{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "gpuType": "V100",
      "machine_shape": "hm"
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Machine Learning Compilation for Beginners\n",
        "\n",
        "## Tutorial Overview\n",
        "\n",
        "Demo: Dr Dignity (Local Models are Private, Data Compliant, &Free (No OpenAI)\n",
        "1. Build a Simple Language Model in Python!\n",
        "2. Accelerate on NVIDIA GPUs using Compiler #1: CUDA\n",
        "3. Accelerate on iOS GPUs using Compiler #2: Metal\n",
        "4. Accelerate on Android GPUs using Compiler #3: Vulkan\n",
        "5. Accelerate on iOS & Android GPUs using Compiler #4: Tensor Virtual Machine\n",
        "6. Compile a Large Language Model from HuggingFace to iOS & Android with TVM\n",
        "7. Future Directions\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "5LJhJCt7KHRr"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Dependencies\n",
        "\n",
        "- Python for Programming\n",
        "- Numpy for CPU speed up\n",
        "- CUDA for Nvidia GPU speed up\n",
        "- Metal for iOS GPU speed up\n",
        "- Vulkan for Android GPU speed up\n",
        "- Tensor Virtual Machine for iOS and Android GPU speed up\n",
        "- Relay to Optimize for iOS and Android\n",
        "- MLC-LLM for LLM-specific Compilation Tools\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "2IhKZxxCGjb2"
      }
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 741
        },
        "id": "h8eCPpGoJm1p",
        "outputId": "c08e1312-960f-4924-9368-448c467d0f10"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Looking in links: https://mlc.ai/wheels\n",
            "Collecting mlc_ai_nightly\n",
            "  Using cached https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_ai_nightly-0.12.dev1576-cp310-cp310-manylinux_2_28_x86_64.whl (86.3 MB)\n",
            "Collecting attrs (from mlc_ai_nightly)\n",
            "  Using cached attrs-23.1.0-py3-none-any.whl (61 kB)\n",
            "Collecting cloudpickle (from mlc_ai_nightly)\n",
            "  Using cached cloudpickle-2.2.1-py3-none-any.whl (25 kB)\n",
            "Collecting decorator (from mlc_ai_nightly)\n",
            "  Using cached decorator-5.1.1-py3-none-any.whl (9.1 kB)\n",
            "Collecting ml-dtypes (from mlc_ai_nightly)\n",
            "  Using cached ml_dtypes-0.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)\n",
            "Collecting numpy (from mlc_ai_nightly)\n",
            "  Using cached numpy-1.26.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)\n",
            "Collecting psutil (from mlc_ai_nightly)\n",
            "  Using cached psutil-5.9.5-cp36-abi3-manylinux_2_12_x86_64.manylinux2010_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (282 kB)\n",
            "Collecting scipy (from mlc_ai_nightly)\n",
            "  Using cached scipy-1.11.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (36.3 MB)\n",
            "Collecting tornado (from mlc_ai_nightly)\n",
            "  Using cached tornado-6.3.3-cp38-abi3-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (427 kB)\n",
            "Collecting typing-extensions (from mlc_ai_nightly)\n",
            "  Using cached typing_extensions-4.8.0-py3-none-any.whl (31 kB)\n",
            "Installing collected packages: typing-extensions, tornado, psutil, numpy, decorator, cloudpickle, attrs, scipy, ml-dtypes, mlc_ai_nightly\n",
            "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
            "ipython 7.34.0 requires jedi>=0.16, which is not installed.\n",
            "cupy-cuda11x 11.0.0 requires numpy<1.26,>=1.20, but you have numpy 1.26.0 which is incompatible.\n",
            "google-colab 1.0.0 requires tornado==6.3.2, but you have tornado 6.3.3 which is incompatible.\n",
            "moviepy 1.0.3 requires decorator<5.0,>=4.0.2, but you have decorator 5.1.1 which is incompatible.\n",
            "numba 0.56.4 requires numpy<1.24,>=1.18, but you have numpy 1.26.0 which is incompatible.\n",
            "tensorflow 2.13.0 requires numpy<=1.24.3,>=1.22, but you have numpy 1.26.0 which is incompatible.\n",
            "tensorflow 2.13.0 requires typing-extensions<4.6.0,>=3.6.6, but you have typing-extensions 4.8.0 which is incompatible.\u001b[0m\u001b[31m\n",
            "\u001b[0mSuccessfully installed attrs-23.1.0 cloudpickle-2.2.1 decorator-4.4.2 ml-dtypes-0.2.0 mlc_ai_nightly-0.12.dev1576 numpy-1.23.5 psutil-5.9.5 scipy-1.11.2 tornado-6.3.2 typing-extensions-4.5.0\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "application/vnd.colab-display-data+json": {
              "pip_warning": {
                "packages": [
                  "decorator",
                  "numpy",
                  "psutil",
                  "tornado",
                  "tvm"
                ]
              }
            }
          },
          "metadata": {}
        }
      ],
      "source": [
        "!pip install pycuda #CUDA, use an a100 instance\n",
        "!sudo apt install -y libvulkan-dev #Vulkan\n",
        "!pip install apache-tvm #Tensor Virtual Machine\n",
        "!pip install -I mlc_ai_nightly -f https://mlc.ai/wheels #MLC-LLM"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 1 - Build a Simple Language Model in Python!"
      ],
      "metadata": {
        "id": "TQ6uxfJUKDWc"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "\n",
        "- This Model aims to predict the next character in the sequence based on the current character.\n",
        "- The model is trained on a dataset of individual characters ('a', 'b', 'c', 'd', 'e')\n",
        "- This is a one-hot encoded character prediction model.\n",
        "- ![Alt Text](https://miro.medium.com/v2/resize:fit:714/0*TsV7C_p9Yhkeqthg.png)\n",
        "- ![Alt Text](https://raw.githubusercontent.com/geekquad/deep-learning-v2-pytorch/master/recurrent-neural-networks/char-rnn/assets/charseq.jpeg)\n",
        "\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "cLNs8HOJ6UX-"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Import the NumPy library for numerical computations\n",
        "import numpy as np\n",
        "\n",
        "# ---- Data and Parameters ----\n",
        "\n",
        "# Initialize the data as a string \"abcde\"\n",
        "data = \"abcde\"\n",
        "# Create a dictionary to map each unique character in the data to an index\n",
        "vocab = {c: i for i, c in enumerate(set(data))}\n",
        "# Create an inverse dictionary to map each index back to its corresponding character\n",
        "inv_vocab = {i: c for c, i in vocab.items()}\n",
        "# Initialize the weight matrix with random values; its dimensions are based on the vocabulary size\n",
        "w = np.random.rand(len(vocab), len(vocab))\n",
        "# Set the learning rate for the model\n",
        "lr = 0.1\n",
        "\n",
        "# ---- Data Preprocessing ----\n",
        "\n",
        "# Convert each character in the data to its corresponding numerical index\n",
        "data_idx = [vocab[c] for c in data]\n",
        "\n",
        "# ---- Training Loop ----\n",
        "\n",
        "# Loop through the data (ignoring the last character)\n",
        "for i in range(len(data_idx) - 1):\n",
        "    # Extract the current and next characters' indices\n",
        "    x, y = data_idx[i], data_idx[i + 1]\n",
        "    # Generate a one-hot encoded vector for the current character\n",
        "    x_onehot = np.eye(len(vocab))[x]\n",
        "\n",
        "    # ---- Forward Pass ----\n",
        "\n",
        "    # Compute the dot product of the one-hot vector and the weight matrix, then apply the exponential function\n",
        "    pred = np.exp(np.dot(x_onehot, w))\n",
        "    # Normalize the resulting vector to form a probability distribution\n",
        "    pred /= np.sum(pred)\n",
        "\n",
        "    # ---- Loss and Gradient Calculation ----\n",
        "\n",
        "    # Compute the negative log-likelihood loss\n",
        "    loss = -np.log(pred[y])\n",
        "    # Compute the gradient based on the difference between the predicted and actual distributions\n",
        "    grad = pred - np.eye(len(vocab))[y]\n",
        "\n",
        "    # ---- Weight Update ----\n",
        "\n",
        "    # Update the weight matrix using stochastic gradient descent\n",
        "    w[:, x] -= lr * grad\n",
        "\n",
        "# ---- Inference ----\n",
        "\n",
        "# Set the input character for prediction\n",
        "input_char = 'a'\n",
        "# Convert the input character to its corresponding numerical index\n",
        "input_idx = vocab[input_char]\n",
        "# Generate a one-hot encoded vector for the input character\n",
        "input_onehot = np.eye(len(vocab))[input_idx]\n",
        "# Perform the forward pass to get the output probabilities\n",
        "output_prob = np.exp(np.dot(input_onehot, w))\n",
        "# Normalize the output probabilities\n",
        "output_prob /= np.sum(output_prob)\n",
        "# Find the index of the character with the highest predicted probability\n",
        "output_idx = np.argmax(output_prob)\n",
        "# Use the inverse vocabulary to map the index back to a character\n",
        "output_char = inv_vocab[output_idx]\n",
        "\n",
        "# Print the predicted next character given the input\n",
        "print(f\"Given '{input_char}', next predicted character is '{output_char}'\")\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "gnnVK1WVKWJ-",
        "outputId": "59683f5e-e7fe-4794-8f17-060d561c18c4"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Given 'a', next predicted character is 'c'\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Measure Inference Speed on Intel CPU\n",
        "\n",
        "- CPU cycle\n",
        "- ![Alt Text](https://www.computerhope.com/jargon/m/machine-cycle.png)\n",
        "- In a CPU, the memory hierarchy includes multiple levels of cache to store data temporarily for time, space, & power efficiency.\n",
        "- Full Data Flow below\n",
        "- ![Alt Text](https://anilmaurya.files.wordpress.com/2016/02/cpu-block-diagram.gif)\n"
      ],
      "metadata": {
        "id": "CvrTMzvoKbR2"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Import the 'time' library for timing operations\n",
        "import time\n",
        "# Import the 'numpy' library for numerical operations\n",
        "import numpy as np\n",
        "\n",
        "# Initialize an empty list to store the time taken for each inference operation\n",
        "times = []\n",
        "\n",
        "# Loop to run the inference process 1000 times\n",
        "for _ in range(1000):\n",
        "    # Record the start time before running inference\n",
        "    start = time.time()\n",
        "\n",
        "    # Create a one-hot encoded input array for inference (here, the first element is 1 and the rest are 0)\n",
        "    input_onehot = np.eye(5)[0]\n",
        "\n",
        "    # Perform the forward pass: dot product of the one-hot input with a random weight matrix, followed by exponentiation\n",
        "    output_prob = np.exp(np.dot(input_onehot, np.random.rand(5, 5)))\n",
        "\n",
        "    # Normalize the output to get a probability distribution\n",
        "    output_prob /= np.sum(output_prob)\n",
        "\n",
        "    # Record the time taken for this inference operation\n",
        "    elapsed_time = time.time() - start\n",
        "\n",
        "    # Append the elapsed time to the 'times' list\n",
        "    times.append(elapsed_time)\n",
        "\n",
        "# Calculate the average time taken for inference across all runs, and convert it to milliseconds\n",
        "average_time = np.mean(times) * 1000\n",
        "\n",
        "# Print the average inference time, rounded to 2 decimal places\n",
        "print(f\"Average Inference Time: {average_time:.2f} ms\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Dcus5YpRKfgM",
        "outputId": "8ce858d9-7976-4b15-8e20-982631222120"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Average Inference Time: 0.07 ms\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 2 - Accelerate on NVIDIA GPUs using Compiler #1: CUDA\n",
        "\n",
        "\n",
        "- ![Alt Text](https://data.embeddedcomputing.com/uploads/articles/wp/1462/54176dc2ec16c-Vivante-Sept-Fig1.jpg)\n",
        "- ![Alt Text](https://pbs.twimg.com/media/EbxlFreUwAAZ2sZ?format=jpg&name=900x900)\n",
        "- ![Alt Text](https://www.researchgate.net/profile/Joshua-Payne-2/publication/265291072/figure/fig1/AS:650829417680896@1532181233055/Performance-comparison-of-GPUs-vs-CPUs.png)\n",
        "- ![Alt Text](http://2.bp.blogspot.com/-UHviPBWWhR8/UPWSIVXHmLI/AAAAAAAAAsU/3gi4jAAtSIU/s1600/compilation.png)\n",
        "- CUDA tutorial https://web.engr.oregonstate.edu/~mjb/cs575/Handouts/cuda.1pp.pdf\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "x03DthaJKhNl"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Import the CUDA driver interface from the PyCUDA library\n",
        "import pycuda.driver as cuda\n",
        "# Initialize the PyCUDA library\n",
        "import pycuda.autoinit\n",
        "# Import the SourceModule for compiling CUDA source code\n",
        "from pycuda.compiler import SourceModule\n",
        "# Import NumPy for numerical operations\n",
        "import numpy as np\n",
        "# Import time for timing the code\n",
        "import time\n",
        "\n",
        "# Define a vocabulary list\n",
        "vocab = ['a', 'b', 'c', 'd', 'e']\n",
        "\n",
        "# Define the CUDA kernel function for inference as a string\n",
        "mod = SourceModule(\"\"\"\n",
        "  __global__ void inference(float *x, float *w, float *y, int vocab_size)\n",
        "  {\n",
        "    const int i = threadIdx.x;  // Get the thread index\n",
        "    float sum = 0;  // Initialize sum variable\n",
        "\n",
        "    // Loop to calculate the sum of element-wise products of 'x' and 'w'\n",
        "    for(int j = 0; j < vocab_size; j++)\n",
        "    {\n",
        "        sum += exp(w[j * vocab_size + i] * x[j]);  // Exponential function applied to the element-wise product\n",
        "    }\n",
        "\n",
        "    // Store the sum in the output array 'y'\n",
        "    y[i] = sum;\n",
        "  }\n",
        "\"\"\")\n",
        "\n",
        "# Define the size of the vocabulary\n",
        "vocab_size = 5\n",
        "# Generate random weights and cast them to float32\n",
        "w = np.random.rand(vocab_size, vocab_size).astype(np.float32)\n",
        "# Create a one-hot vector representation for the first element 'a' in the vocabulary\n",
        "x_onehot = np.eye(vocab_size, dtype=np.float32)[0]  # One-hot vector for 'a'\n",
        "\n",
        "# Allocate GPU memory for weight, input, and output arrays\n",
        "w_gpu = cuda.mem_alloc(w.nbytes)\n",
        "x_gpu = cuda.mem_alloc(x_onehot.nbytes)\n",
        "y_gpu = cuda.mem_alloc(vocab_size * 4)  # 4 bytes for each float32 element\n",
        "\n",
        "# Copy data from host to device (CPU to GPU)\n",
        "cuda.memcpy_htod(w_gpu, w)\n",
        "cuda.memcpy_htod(x_gpu, x_onehot)\n"
      ],
      "metadata": {
        "id": "Vy-KsqfZK11p"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Measure Inference Speed on Nvidia GPU"
      ],
      "metadata": {
        "id": "jhsPJnjmLBWr"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Retrieve the compiled CUDA kernel function named \"inference\"\n",
        "func = mod.get_function(\"inference\")\n",
        "# Record the starting time for timing the execution\n",
        "start_time = time.time()\n",
        "\n",
        "# Execute the CUDA kernel function 1000 times\n",
        "for _ in range(1000):\n",
        "    # Run the CUDA function with the given parameters and thread configuration\n",
        "    func(x_gpu, w_gpu, y_gpu, np.int32(vocab_size), block=(vocab_size, 1, 1))\n",
        "\n",
        "# Synchronize the CUDA context to ensure all operations are complete\n",
        "cuda.Context.synchronize()\n",
        "# Record the ending time to calculate the elapsed time\n",
        "end_time = time.time()\n",
        "\n",
        "# Create an empty NumPy array with the same shape as x_onehot to store results\n",
        "y = np.empty_like(x_onehot)\n",
        "# Copy the result from GPU to CPU (device to host)\n",
        "cuda.memcpy_dtoh(y, y_gpu)\n",
        "# Apply the exponential function to the result to undo the previous log operation\n",
        "y = np.exp(y)\n",
        "# Normalize the probabilities so they sum to 1\n",
        "y /= np.sum(y)\n",
        "\n",
        "# Find the index of the maximum value in the array 'y'\n",
        "predicted_index = np.argmax(y)\n",
        "\n",
        "# Print the calculated probabilities\n",
        "print(f\"Predicted probabilities: {y}\")\n",
        "# Print the most probable next character based on the given input 'a'\n",
        "print(f\"Given 'a', the next predicted character is: '{vocab[predicted_index]}'\")\n",
        "# Print the average time taken for inference\n",
        "print(f\"Average Inference Time: {(end_time - start_time) / 1000 * 1000:.2f} ms\")\n"
      ],
      "metadata": {
        "id": "0hmDyTp3LIcW",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "828a5300-9037-4f1c-9d73-bb27c63e4ad6"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Predicted probabilities: [0.14863102 0.15203634 0.15777873 0.310623   0.23093095]\n",
            "Given 'a', the next predicted character is: 'd'\n",
            "Average Inference Time: 0.01 ms\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "/usr/local/lib/python3.10/dist-packages/google/colab/_variable_inspector.py:27: UserWarning: module in out-of-thread context could not be cleaned up\n",
            "  globals().clear()\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 3 - Accelerate on iOS GPUs using Compiler #2: Metal\n",
        "\n",
        "\n",
        "- There are many GPU providers, not just Nvidia\n",
        "- ![Alt Text](https://i.ytimg.com/vi/6GOzFB7bAqg/sddefault.jpg)\n",
        "- Apple has it's A Series Chips\n",
        "- ![Alt Text](https://static.wikia.nocookie.net/ipod/images/8/8d/Apple_A15_simplified_schematic_parts.jpg/revision/latest?cb=20220911035113)\n",
        "- The Metal Compiler lets us leverage it\n",
        "- ![Alt Text](https://devimages-cdn.apple.com/wwdc-services/images/124/6556/6556_wide_250x141_2x.jpg)\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "CZbnsy5hLqXo"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Time to leave Colab...sorry, not sorry Nvidia.\n",
        "\n",
        "1. Open Xcode\n",
        "2. Create a new command-line project\n",
        "3. Paste the below inference function into it\n",
        "4. Create a .metal file\n",
        "7. Replace the .metal reference directory with the second cell block\n",
        "8. Build and run the project!"
      ],
      "metadata": {
        "id": "0JGYKmHtHyzk"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Language model in Objective-C"
      ],
      "metadata": {
        "id": "mIkP-gm2hJ63"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "```objective-c\n",
        "\n",
        "#import <Foundation/Foundation.h>\n",
        "#import <Metal/Metal.h>\n",
        "\n",
        "#import <math.h>\n",
        "\n",
        "@interface SimpleLanguageModel : NSObject\n",
        "\n",
        "@property float *w;\n",
        "@property float lr;\n",
        "@property NSDictionary *vocab;\n",
        "@property NSDictionary *inv_vocab;\n",
        "@property NSUInteger vocabSize;\n",
        "\n",
        "- (instancetype)initWithData:(NSString *)data learningRate:(float)lr;\n",
        "- (void)train;\n",
        "- (NSString *)predictNextCharacter:(NSString *)inputChar;\n",
        "\n",
        "@end\n",
        "\n",
        "@implementation SimpleLanguageModel\n",
        "\n",
        "- (instancetype)initWithData:(NSString *)data learningRate:(float)lr {\n",
        "    self = [super init];\n",
        "    if (self) {\n",
        "        self.lr = lr;\n",
        "        self.vocab = [self buildVocab:data];\n",
        "        self.inv_vocab = [self buildInvVocab:self.vocab];\n",
        "        self.vocabSize = [self.vocab count];\n",
        "        self.w = malloc(self.vocabSize * self.vocabSize * sizeof(float));\n",
        "        for (int i = 0; i < self.vocabSize * self.vocabSize; i++) {\n",
        "            self.w[i] = ((float)rand() / RAND_MAX);\n",
        "        }\n",
        "    }\n",
        "    return self;\n",
        "}\n",
        "\n",
        "- (NSDictionary *)buildVocab:(NSString *)data {\n",
        "    NSMutableDictionary *vocab = [NSMutableDictionary dictionary];\n",
        "    for (int i = 0; i < [data length]; i++) {\n",
        "        NSString *charStr = [NSString stringWithFormat:@\"%C\", [data characterAtIndex:i]];\n",
        "        [vocab setObject:[NSNumber numberWithInt:i] forKey:charStr];\n",
        "    }\n",
        "    return [vocab copy];\n",
        "}\n",
        "\n",
        "- (NSDictionary *)buildInvVocab:(NSDictionary *)vocab {\n",
        "    NSMutableDictionary *inv_vocab = [NSMutableDictionary dictionary];\n",
        "    for (NSString *key in vocab) {\n",
        "        inv_vocab[vocab[key]] = key;\n",
        "    }\n",
        "    return [inv_vocab copy];\n",
        "}\n",
        "\n",
        "- (void)train {\n",
        "    NSArray *data = @[@0, @1, @2, @3, @4];  // Assuming the vocab indices are from 0 to 4\n",
        "    \n",
        "    for (NSNumber *x in data) {\n",
        "        NSNumber *y = [data objectAtIndex:([data indexOfObject:x] + 1) % [data count]];\n",
        "        \n",
        "        float pred[self.vocabSize];\n",
        "        for (int i = 0; i < self.vocabSize; i++) {\n",
        "            pred[i] = exp(self.w[i * self.vocabSize + [x intValue]]);\n",
        "        }\n",
        "        float sumPred = 0;\n",
        "        for (int i = 0; i < self.vocabSize; i++) {\n",
        "            sumPred += pred[i];\n",
        "        }\n",
        "        for (int i = 0; i < self.vocabSize; i++) {\n",
        "            pred[i] /= sumPred;\n",
        "        }\n",
        "        \n",
        "        float loss = -log(pred[[y intValue]]);\n",
        "        NSLog(@\"Loss: %f\", loss);\n",
        "        \n",
        "        float grad[self.vocabSize];\n",
        "        for (int i = 0; i < self.vocabSize; i++) {\n",
        "            grad[i] = pred[i] - (i == [y intValue] ? 1 : 0);\n",
        "        }\n",
        "        \n",
        "        for (int i = 0; i < self.vocabSize; i++) {\n",
        "            self.w[i * self.vocabSize + [x intValue]] -= self.lr * grad[i];\n",
        "        }\n",
        "    }\n",
        "}\n",
        "\n",
        "- (NSString *)predictNextCharacter:(NSString *)inputChar {\n",
        "    int inputIdx = [[self.vocab objectForKey:inputChar] intValue];\n",
        "    \n",
        "    float output_prob[self.vocabSize];\n",
        "    for (int i = 0; i < self.vocabSize; i++) {\n",
        "        output_prob[i] = exp(self.w[i * self.vocabSize + inputIdx]);\n",
        "    }\n",
        "    \n",
        "    float sumProb = 0;\n",
        "    for (int i = 0; i < self.vocabSize; i++) {\n",
        "        sumProb += output_prob[i];\n",
        "    }\n",
        "    for (int i = 0; i < self.vocabSize; i++) {\n",
        "        output_prob[i] /= sumProb;\n",
        "    }\n",
        "    \n",
        "    int outputIdx = 0;\n",
        "    float maxProb = 0;\n",
        "    for (int i = 0; i < self.vocabSize; i++) {\n",
        "        if (output_prob[i] > maxProb) {\n",
        "            maxProb = output_prob[i];\n",
        "            outputIdx = i;\n",
        "        }\n",
        "    }\n",
        "    \n",
        "    return [self.inv_vocab objectForKey:[NSNumber numberWithInt:outputIdx]];\n",
        "}\n",
        "\n",
        "@end\n",
        "\n",
        "\n",
        "int main(int argc, const char * argv[]) {\n",
        "    @autoreleasepool {\n",
        "        \n",
        "        id<MTLDevice> device = MTLCreateSystemDefaultDevice();\n",
        "        NSError *error = nil;\n",
        "        \n",
        "        NSString *metalSrc = [NSString stringWithContentsOfFile:@\"/Users/sirajraval/Desktop/metal_test/metal_test/fun.metal\" encoding:NSUTF8StringEncoding error:&error];\n",
        "        id<MTLLibrary> library = [device newLibraryWithSource:metalSrc options:nil error:&error];\n",
        "        \n",
        "        id<MTLFunction> function = [library newFunctionWithName:@\"predict_next_character\"];\n",
        "        id<MTLComputePipelineState> pipelineState = [device newComputePipelineStateWithFunction:function error:&error];\n",
        "        \n",
        "        id<MTLCommandQueue> commandQueue = [device newCommandQueue];\n",
        "        id<MTLCommandBuffer> commandBuffer = [commandQueue commandBuffer];\n",
        "        id<MTLComputeCommandEncoder> commandEncoder = [commandBuffer computeCommandEncoder];\n",
        "        \n",
        "        [commandEncoder setComputePipelineState:pipelineState];\n",
        "        \n",
        "        SimpleLanguageModel *model = [[SimpleLanguageModel alloc] initWithData:@\"abcde\" learningRate:0.1];\n",
        "        \n",
        "        // Convert your weights to an MTLBuffer (in this example, I'm assuming the weights are a 5x5 matrix)\n",
        "        id<MTLBuffer> weightBuffer = [device newBufferWithBytes:model.w length:25*sizeof(float) options:MTLResourceStorageModeShared];\n",
        "        \n",
        "        float input_onehot[] = {1.0, 0.0, 0.0, 0.0, 0.0};  // Assume 'a' is encoded as this one-hot vector\n",
        "        id<MTLBuffer> inputBuffer = [device newBufferWithBytes:input_onehot length:5*sizeof(float) options:MTLResourceStorageModeShared];\n",
        "        \n",
        "        float output_prob[5] = {0};\n",
        "        id<MTLBuffer> outputBuffer = [device newBufferWithBytes:output_prob length:5*sizeof(float) options:MTLResourceStorageModeShared];\n",
        "        \n",
        "        [commandEncoder setBuffer:weightBuffer offset:0 atIndex:0];\n",
        "        [commandEncoder setBuffer:inputBuffer offset:0 atIndex:1];\n",
        "        [commandEncoder setBuffer:outputBuffer offset:0 atIndex:2];\n",
        "        \n",
        "        MTLSize gridSize = MTLSizeMake(5, 1, 1);\n",
        "        MTLSize threadGroupSize = MTLSizeMake(1, 1, 1);\n",
        "        \n",
        "        [commandEncoder dispatchThreadgroups:gridSize threadsPerThreadgroup:threadGroupSize];\n",
        "        \n",
        "        [commandEncoder endEncoding];\n",
        "        [commandBuffer commit];\n",
        "        \n",
        "\n",
        "\n",
        "               \n",
        "        \n",
        "        \n",
        "        [commandBuffer waitUntilCompleted];\n",
        "        \n",
        "        memcpy(output_prob, [outputBuffer contents], sizeof(output_prob));\n",
        "        \n",
        "        // Sum up all the probabilities to normalize them\n",
        "        double sum = 0.0;\n",
        "        for (int i = 0; i < model.vocabSize; i++) {\n",
        "            sum += output_prob[i];\n",
        "        }\n",
        "\n",
        "        // Normalize each output probability\n",
        "        for (int i = 0; i < model.vocabSize; i++) {\n",
        "            output_prob[i] /= sum;\n",
        "        }\n",
        "\n",
        "        // Find the index of the most likely next character\n",
        "        int maxIndex = 0;\n",
        "        double maxProb = 0.0;\n",
        "        for (int i = 0; i < model.vocabSize; i++) {\n",
        "            if (output_prob[i] > maxProb) {\n",
        "                maxProb = output_prob[i];\n",
        "                maxIndex = i;\n",
        "            }\n",
        "        }\n",
        "\n",
        "        // Convert the index back to a character\n",
        "        NSString *nextChar = [model.inv_vocab objectForKey:[NSNumber numberWithInt:maxIndex]];\n",
        "\n",
        "        NSLog(@\"Given '%s', the next predicted character is '%@'\", \"a\", nextChar);\n",
        "        \n",
        "\n",
        "        NSDate *startTime = [NSDate date];  // Capture the start time\n",
        "        NSString *nexty = [model predictNextCharacter:@\"a\"];\n",
        "        NSDate *endTime = [NSDate date];  // Capture the end time\n",
        "        NSTimeInterval timeInterval = [endTime timeIntervalSinceDate:startTime];\n",
        "\n",
        "        NSLog(@\"Inference took %f milliseconds\", timeInterval*1000);\n",
        "\n",
        "\n",
        "    }\n",
        "    return 0;\n",
        "}\n",
        "\n",
        "```"
      ],
      "metadata": {
        "id": "1XhHomwPg8RG"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Metal Code for Inference Speed Up on Apple GPUs"
      ],
      "metadata": {
        "id": "ZdiiaJmshX4G"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "```objective-c\n",
        "#include <metal_stdlib>\n",
        "using namespace metal;\n",
        "\n",
        "kernel void predict_next_character(\n",
        "    constant float *weights [[buffer(0)]],\n",
        "    device float *input_onehot [[buffer(1)]],\n",
        "    device float *output_prob [[buffer(2)]],\n",
        "    uint id [[thread_position_in_grid]]\n",
        ") {\n",
        "    float prob = 0.0;\n",
        "    for (uint i = 0; i < 5; ++i) {\n",
        "        prob += weights[id * 5 + i] * input_onehot[i];\n",
        "    }\n",
        "    output_prob[id] = exp(prob);\n",
        "}\n",
        "\n",
        "```"
      ],
      "metadata": {
        "id": "L_MFtbTuGkk1"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 4 - Accelerate on Android GPUs using Compiler #3: Vulkan\n",
        "\n",
        "- Most Phones use Android\n",
        "- ![Alt Text](https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjV3ABZMrT-oJ3vcUTwJr6xp4FKwyD0Gg2f6ZayIDpP9HyXDyWtq0nPZVVF0zye4gAr-OU-2ehNdJTpJ7b9no2Ebj7TMcHsoz72LKewuC5Y4IEbH_ghFFuFhtYn_GohJQcKbDbuubXEESwwBPaLxzk4CT7GnjfcE5hwTZxav6ekTvMck09Rc8ccRASp/s2092/os-market-share.png)\n",
        "\n",
        "- Lots of Different GPU vendors in the Android Ecosystem\n",
        "\n",
        "- ![Alt Text](https://www.techspot.com/images2/news/bigimage/2020/12/2020-12-26-image-2.png)\n",
        "\n",
        "Vulkan is the best API, it's a successor to OpenGL\n",
        "\n",
        "- ![Alt Text](https://i.ytimg.com/vi/rvCD9FaTKCA/maxresdefault.jpg)\n",
        "- ![Alt Text](https://architosh.com/wp-content/uploads/2015/03/opengl-v-vulkan.jpg)\n",
        "\n",
        "- ![Alt Text](https://registry.khronos.org/vulkan/site/guide/latest/_images/platforms_overview.png)\n",
        "\n",
        "- Let's leverage the video memory, the vRAM\n",
        "\n",
        "- ![Alt Text](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e6/Division_of_labor_cpu_and_gpu.svg/500px-Division_of_labor_cpu_and_gpu.svg.png)\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "4EgFaPyrNV7y"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "```c\n",
        "%%writefile vulkan.cpp\n",
        "\n",
        "\n",
        "#include <iostream>\n",
        "#include <fstream>\n",
        "#include <unordered_map>\n",
        "#include <vector>\n",
        "#include <cmath>\n",
        "#include <numeric>\n",
        "#include <algorithm>\n",
        "#include <chrono>  \n",
        "#include <vulkan/vulkan.h>\n",
        "\n",
        "\n",
        "\n",
        "VkInstance instance;\n",
        "VkPhysicalDevice physicalDevice;\n",
        "VkDevice device;\n",
        "VkPipelineLayout pipelineLayout;\n",
        "VkCommandPool commandPool;\n",
        "VkCommandBuffer commandBuffer;\n",
        "VkPipeline computePipeline;\n",
        "\n",
        "\n",
        "\n",
        "using namespace std;\n",
        "\n",
        "void createInstance() {\n",
        "    VkApplicationInfo appInfo{};\n",
        "    appInfo.sType = VK_STRUCTURE_TYPE_APPLICATION_INFO;\n",
        "    appInfo.pApplicationName = \"Language Model\";\n",
        "    appInfo.applicationVersion = VK_MAKE_VERSION(1, 0, 0);\n",
        "    appInfo.pEngineName = \"No Engine\";\n",
        "    appInfo.engineVersion = VK_MAKE_VERSION(1, 0, 0);\n",
        "    appInfo.apiVersion = VK_API_VERSION_1_0;\n",
        "\n",
        "    VkInstanceCreateInfo createInfo{};\n",
        "    createInfo.sType = VK_STRUCTURE_TYPE_INSTANCE_CREATE_INFO;\n",
        "    createInfo.pApplicationInfo = &appInfo;\n",
        "\n",
        "    if (vkCreateInstance(&createInfo, nullptr, &instance) != VK_SUCCESS) {\n",
        "        throw std::runtime_error(\"Failed to create Vulkan instance!\");\n",
        "    }\n",
        "\n",
        "}\n",
        "\n",
        "void runComputationOnGPU() {\n",
        "    // Start recording commands\n",
        "    VkCommandBufferBeginInfo beginInfo{};\n",
        "    beginInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;\n",
        "\n",
        "    if (vkBeginCommandBuffer(commandBuffer, &beginInfo) != VK_SUCCESS) {\n",
        "        throw std::runtime_error(\"Failed to begin recording command buffer!\");\n",
        "    }\n",
        "\n",
        "    // Add your Vulkan compute dispatch command here\n",
        "    vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_COMPUTE, computePipeline);\n",
        "    vkCmdDispatch(commandBuffer, 1, 1, 1);\n",
        "\n",
        "    // End recording\n",
        "    if (vkEndCommandBuffer(commandBuffer) != VK_SUCCESS) {\n",
        "        throw std::runtime_error(\"Failed to record command buffer!\");\n",
        "    }\n",
        "\n",
        "    // Submit the command buffer to the queue\n",
        "    VkSubmitInfo submitInfo{};\n",
        "    submitInfo.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;\n",
        "    submitInfo.commandBufferCount = 1;\n",
        "    submitInfo.pCommandBuffers = &commandBuffer;\n",
        "\n",
        "    VkQueue computeQueue;  // Should be initialized properly\n",
        "    if (vkQueueSubmit(computeQueue, 1, &submitInfo, VK_NULL_HANDLE) != VK_SUCCESS) {\n",
        "        throw std::runtime_error(\"Failed to submit command buffer!\");\n",
        "    }\n",
        "\n",
        "    vkQueueWaitIdle(computeQueue);  // Ensure the computation is finished\n",
        "}\n",
        "\n",
        "\n",
        "void pickPhysicalDevice() {\n",
        "    uint32_t deviceCount = 0;\n",
        "    vkEnumeratePhysicalDevices(instance, &deviceCount, nullptr);\n",
        "\n",
        "    if (deviceCount == 0) {\n",
        "        throw std::runtime_error(\"Failed to find GPUs with Vulkan support!\");\n",
        "    }\n",
        "\n",
        "    std::vector<VkPhysicalDevice> devices(deviceCount);\n",
        "    vkEnumeratePhysicalDevices(instance, &deviceCount, devices.data());\n",
        "\n",
        "    for (const auto& dev : devices) {\n",
        "        if (true /* Here we can put conditions for picking the best device */) {\n",
        "            physicalDevice = dev;\n",
        "            break;\n",
        "        }\n",
        "    }\n",
        "\n",
        "    if (physicalDevice == VK_NULL_HANDLE) {\n",
        "        throw std::runtime_error(\"Failed to find a suitable GPU!\");\n",
        "    }\n",
        "}\n",
        "\n",
        "void createLogicalDevice() {\n",
        "    VkDeviceQueueCreateInfo queueCreateInfo{};\n",
        "    queueCreateInfo.sType = VK_STRUCTURE_TYPE_DEVICE_QUEUE_CREATE_INFO;\n",
        "    queueCreateInfo.queueFamilyIndex = 0; // Should be determined properly\n",
        "    queueCreateInfo.queueCount = 1;\n",
        "\n",
        "    float queuePriority = 1.0f;\n",
        "    queueCreateInfo.pQueuePriorities = &queuePriority;\n",
        "\n",
        "    VkDeviceCreateInfo deviceInfo{};\n",
        "    deviceInfo.sType = VK_STRUCTURE_TYPE_DEVICE_CREATE_INFO;\n",
        "    deviceInfo.pQueueCreateInfos = &queueCreateInfo;\n",
        "    deviceInfo.queueCreateInfoCount = 1;\n",
        "\n",
        "    VkPhysicalDeviceFeatures deviceFeatures{};\n",
        "    deviceInfo.pEnabledFeatures = &deviceFeatures;\n",
        "\n",
        "    if (vkCreateDevice(physicalDevice, &deviceInfo, nullptr, &device) != VK_SUCCESS) {\n",
        "        throw std::runtime_error(\"Failed to create logical device!\");\n",
        "    }\n",
        "}\n",
        "\n",
        "void createCommandPool() {\n",
        "    VkCommandPoolCreateInfo poolInfo{};\n",
        "    poolInfo.sType = VK_STRUCTURE_TYPE_COMMAND_POOL_CREATE_INFO;\n",
        "    poolInfo.queueFamilyIndex = 0;  // Should be determined properly\n",
        "\n",
        "    if (vkCreateCommandPool(device, &poolInfo, nullptr, &commandPool) != VK_SUCCESS) {\n",
        "        throw std::runtime_error(\"Failed to create command pool!\");\n",
        "    }\n",
        "}\n",
        "\n",
        "\n",
        "void createCommandBuffer() {\n",
        "    VkCommandBufferAllocateInfo allocInfo{};\n",
        "    allocInfo.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_ALLOCATE_INFO;\n",
        "    allocInfo.commandPool = commandPool;\n",
        "    allocInfo.level = VK_COMMAND_BUFFER_LEVEL_PRIMARY;\n",
        "    allocInfo.commandBufferCount = 1;\n",
        "\n",
        "    if (vkAllocateCommandBuffers(device, &allocInfo, &commandBuffer) != VK_SUCCESS) {\n",
        "        throw std::runtime_error(\"Failed to allocate command buffer!\");\n",
        "    }\n",
        "}\n",
        "\n",
        "\n",
        "VkBuffer createBuffer(VkDevice device, VkPhysicalDevice physicalDevice, VkDeviceSize size, VkBufferUsageFlags usage) {\n",
        "    VkBufferCreateInfo bufferInfo{};\n",
        "    bufferInfo.sType = VK_STRUCTURE_TYPE_BUFFER_CREATE_INFO;\n",
        "    bufferInfo.size = size;\n",
        "    bufferInfo.usage = usage;\n",
        "    bufferInfo.sharingMode = VK_SHARING_MODE_EXCLUSIVE;\n",
        "\n",
        "    VkBuffer buffer;\n",
        "    if (vkCreateBuffer(device, &bufferInfo, nullptr, &buffer) != VK_SUCCESS) {\n",
        "        throw std::runtime_error(\"Failed to create buffer.\");\n",
        "    }\n",
        "\n",
        "    return buffer;\n",
        "}\n",
        "\n",
        "void copyDataToBuffer(VkDevice device, VkBuffer buffer, void* data, VkDeviceSize size) {\n",
        "    // Code to map buffer memory and copy data to buffer\n",
        "    // Implementation can vary based on how you've set up memory\n",
        "    // Typically you'd use vkMapMemory, memcpy, and vkUnmapMemory\n",
        "}\n",
        "\n",
        "void createAndFillBuffer(VkDevice device, VkPhysicalDevice physicalDevice, void* data, VkDeviceSize size, VkBufferUsageFlags usage, VkBuffer& buffer) {\n",
        "    buffer = createBuffer(device, physicalDevice, size, usage);\n",
        "    copyDataToBuffer(device, buffer, data, size);\n",
        "}\n",
        "\n",
        "VkShaderModule createShaderModule(VkDevice device, const std::vector<char>& code) {\n",
        "    VkShaderModuleCreateInfo createInfo{};\n",
        "    createInfo.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;\n",
        "    createInfo.codeSize = code.size();\n",
        "    createInfo.pCode = reinterpret_cast<const uint32_t*>(code.data());\n",
        "\n",
        "    VkShaderModule shaderModule;\n",
        "    if (vkCreateShaderModule(device, &createInfo, nullptr, &shaderModule) != VK_SUCCESS) {\n",
        "        throw std::runtime_error(\"Failed to create shader module\");\n",
        "    }\n",
        "\n",
        "    return shaderModule;\n",
        "}\n",
        "\n",
        "void createShaderModuler()  {\n",
        "\n",
        "\n",
        "    VkPipelineLayout pipelineLayout;\n",
        "VkPipelineLayoutCreateInfo pipelineLayoutInfo{};\n",
        "pipelineLayoutInfo.sType = VK_STRUCTURE_TYPE_PIPELINE_LAYOUT_CREATE_INFO;\n",
        "pipelineLayoutInfo.setLayoutCount = 0; // Optional\n",
        "pipelineLayoutInfo.pSetLayouts = nullptr; // Optional\n",
        "pipelineLayoutInfo.pushConstantRangeCount = 0; // Optional\n",
        "pipelineLayoutInfo.pPushConstantRanges = nullptr; // Optional\n",
        "\n",
        "if (vkCreatePipelineLayout(device, &pipelineLayoutInfo, nullptr, &pipelineLayout) != VK_SUCCESS) {\n",
        "    throw std::runtime_error(\"Failed to create pipeline layout!\");\n",
        "}\n",
        "\n",
        "}\n",
        "\n",
        "std::vector<char> readShaderFile(const std::string& filename) {\n",
        "    std::ifstream file(filename, std::ios::ate | std::ios::binary);\n",
        "\n",
        "    if (!file.is_open()) {\n",
        "        throw std::runtime_error(\"Failed to open file!\");\n",
        "    }\n",
        "\n",
        "    size_t fileSize = (size_t) file.tellg();\n",
        "    std::vector<char> buffer(fileSize);\n",
        "\n",
        "    file.seekg(0);\n",
        "    file.read(buffer.data(), fileSize);\n",
        "\n",
        "    file.close();\n",
        "\n",
        "    return buffer;\n",
        "}\n",
        "\n",
        "\n",
        "void createPipeline()  {\n",
        "\n",
        "VkPipeline computePipeline;\n",
        "VkComputePipelineCreateInfo pipelineInfo{};\n",
        "pipelineInfo.sType = VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO;\n",
        "pipelineInfo.stage.sType = VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;\n",
        "pipelineInfo.stage.stage = VK_SHADER_STAGE_COMPUTE_BIT;\n",
        "\n",
        "\n",
        "std::vector<char> shaderCode = readShaderFile(\"/comp.spv\");  // Make sure this path is correct\n",
        "VkShaderModule shaderModule = createShaderModule(device, shaderCode);\n",
        "\n",
        "pipelineInfo.stage.module = shaderModule; // The shaderModule you created\n",
        "pipelineInfo.stage.pName = \"main\";\n",
        "pipelineInfo.layout = pipelineLayout;\n",
        "\n",
        "if (vkCreateComputePipelines(device, VK_NULL_HANDLE, 1, &pipelineInfo, nullptr, &computePipeline) != VK_SUCCESS) {\n",
        "    throw std::runtime_error(\"Failed to create compute pipeline!\");\n",
        "}\n",
        "\n",
        "\n",
        "}\n",
        "\n",
        "\n",
        "int main() {\n",
        "\n",
        "      try {\n",
        "        createInstance();\n",
        "        pickPhysicalDevice();\n",
        "        createLogicalDevice();\n",
        "        createShaderModuler();\n",
        "        createPipeline();\n",
        "        createCommandPool();\n",
        "        createCommandBuffer();\n",
        "    } catch (const std::exception& e) {\n",
        "        std::cerr << e.what() << std::endl;\n",
        "        return EXIT_FAILURE;\n",
        "    }\n",
        "\n",
        "\n",
        "    string data = \"abcde\";\n",
        "    unordered_map<char, int> vocab;\n",
        "    unordered_map<int, char> inv_vocab;\n",
        "    int vocab_size = 0;\n",
        "\n",
        "    for (char c : data) {\n",
        "        vocab[c] = vocab_size;\n",
        "        inv_vocab[vocab_size] = c;\n",
        "        vocab_size++;\n",
        "    }\n",
        "\n",
        "    vector<vector<double>> w(vocab_size, vector<double>(vocab_size, rand() % 100 / 100.0));\n",
        "    double lr = 0.1;\n",
        "    vector<int> data_idx;\n",
        "\n",
        "    for (char c : data) {\n",
        "        data_idx.push_back(vocab[c]);\n",
        "    }\n",
        "\n",
        "    for (int i = 0; i < data_idx.size() - 1; ++i) {\n",
        "        int x = data_idx[i], y = data_idx[i + 1];\n",
        "        vector<double> x_onehot(vocab_size, 0);\n",
        "        x_onehot[x] = 1;\n",
        "\n",
        "        vector<double> pred(vocab_size);\n",
        "        for (int j = 0; j < vocab_size; ++j) {\n",
        "            pred[j] = exp(x_onehot[j] * w[j][x]);\n",
        "        }\n",
        "\n",
        "        double sum_pred = accumulate(pred.begin(), pred.end(), 0.0);\n",
        "        for (double &p : pred) p /= sum_pred;\n",
        "\n",
        "        double loss = -log(pred[y]);\n",
        "\n",
        "        vector<double> grad = pred;\n",
        "        grad[y] -= 1;\n",
        "\n",
        "        for (int j = 0; j < vocab_size; ++j) {\n",
        "            w[j][x] -= lr * grad[j];\n",
        "        }\n",
        "    }\n",
        "\n",
        "\n",
        " // Start timer\n",
        "    auto start_time = std::chrono::high_resolution_clock::now();\n",
        "\n",
        "    char input_char = 'a';\n",
        "    int input_idx = vocab[input_char];\n",
        "    vector<double> input_onehot(vocab_size, 0);\n",
        "    input_onehot[input_idx] = 1;\n",
        "\n",
        "    runComputationOnGPU();\n",
        "    vector<double> output_prob(vocab_size);\n",
        "    for (int i = 0; i < vocab_size; ++i) {\n",
        "        output_prob[i] = exp(input_onehot[i] * w[i][input_idx]);\n",
        "    }\n",
        "\n",
        "    double sum_output = accumulate(output_prob.begin(), output_prob.end(), 0.0);\n",
        "    for (double &p : output_prob) p /= sum_output;\n",
        "\n",
        "    int output_idx = max_element(output_prob.begin(), output_prob.end()) -\n",
        "output_prob.begin();\n",
        "    char output_char = inv_vocab[output_idx];\n",
        "\n",
        "    // Stop timer and calculate elapsed time in milliseconds\n",
        "    auto end_time = std::chrono::high_resolution_clock::now();\n",
        "    auto duration = std::chrono::duration_cast<std::chrono::nanoseconds>(end_time -\n",
        "start_time).count();\n",
        "\n",
        "    cout << \"Given '\" << input_char << \"', next predicted character is '\" << output_char <<\n",
        "\"'\\n\";\n",
        "    cout << \"Inference time: \" << duration << \" nanosecondss\\n\";\n",
        "    return 0;\n",
        "}\n",
        "\n",
        "```"
      ],
      "metadata": {
        "id": "0Sk50gUFxT_G"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Run the following 2 commands to compile.\n",
        "\n",
        "- g++ vulkan.cpp -lvulkan -o vulkan -std=c++11\n",
        "- !./vulkan"
      ],
      "metadata": {
        "id": "u_f_vctWxcPZ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 5 - Accelerate on iOS & Android GPUs using Compiler #4: Tensor Virtual Machine"
      ],
      "metadata": {
        "id": "zkcnpOid4LVm"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Hardware Optimization Strategies for Machine Learning\n",
        "\n",
        "- There are many ways to make Inference faster.\n",
        "- Each GPU process data differently.\n",
        "- Below are some examples\n",
        "\n",
        "- ![Alt Text](https://ars.els-cdn.com/content/image/1-s2.0-S1383762120301430-gr26.jpg)\n",
        "\n",
        "\n",
        "## General GPU Optimizations\n",
        "\n",
        "- **Memory Coalescing**: Structure your memory access patterns to make them contiguous and aligned, reducing memory latency.\n",
        "  \n",
        "- **Loop Unrolling**: Reduce the overhead of the loop control code to improve performance.\n",
        "\n",
        "- **Data Tiling**: Break the data into smaller chunks or \"tiles\" to fit into faster but smaller memory caches.\n",
        "\n",
        "- **Vectorization**: Use SIMD (Single Instruction, Multiple Data) operations to process multiple data points in a single instruction.\n",
        "\n",
        "- **Batching**: Process multiple inputs at the same time to make better use of the hardware.\n",
        "\n",
        "- **Prefetching**: Manually load data into cache before it's needed to reduce latency.\n",
        "\n",
        "- **Pipeline Parallelism**: Overlap the execution of independent operations.\n",
        "\n",
        "---\n",
        "\n",
        "## CUDA-Specific Optimizations\n",
        "\n",
        "- **Shared Memory**: Use shared memory to avoid redundant global memory access.\n",
        "\n",
        "- **Warp Shuffle**: Use warp shuffle functions for faster intra-warp communication.\n",
        "\n",
        "- **Stream Multiprocessor Occupancy**: Maximize the use of stream multiprocessors for higher throughput.\n",
        "\n",
        "- **Zero-Copy**: For host-device communication, use zero-copy memory to reduce data transfer times.\n",
        "\n",
        "---\n",
        "## Metal-Specific Optimizations\n",
        "\n",
        "- **Threadgroup Memory**: Use threadgroup shared memory for faster data sharing among threads.\n",
        "\n",
        "- **Metal Performance Shaders**: Use built-in high-performance shaders for common operations where available.\n",
        "\n",
        "- **Resource Groups**: Use argument buffers to efficiently manage resources.\n",
        "\n",
        "- **Half-Precision (fp16)**: Use half-precision floating-point arithmetic where full precision isn't required for performance gain.\n",
        "\n",
        "---\n",
        "\n",
        "## (AMD) ROCm-Specific Optimizations\n",
        "\n",
        "- **Local Data Share (LDS)**: Similar to CUDA's shared memory, use LDS to share data between threads in a wavefront.\n",
        "\n",
        "- **Wavefront Optimizations**: Utilize the full wavefront for computations to improve performance.\n",
        "\n",
        "---\n",
        "\n",
        "## OpenCL-Specific Optimizations\n",
        "\n",
        "- **Local Memory**: Like CUDA's shared memory and ROCm's LDS, use local memory to store frequently accessed data.\n",
        "\n",
        "- **Event-driven Execution**: Use OpenCL events to manage command execution dependencies effectively.\n",
        "\n",
        "- **Image Objects**: Use OpenCL image objects for better caching behavior if applicable.\n",
        "\n",
        "---\n",
        "\n",
        "..... this seems hard to learn\n",
        "\n",
        "\n",
        "### Let's instead learn how to use the Tensor Virtual Machine, a cross platform compiler and runtime for Machine Learning models\n",
        "\n",
        "- ![Alt Text](https://www.opensourceforu.com/wp-content/uploads/2019/06/Figure-1-The-TVM-stack.jpg)\n",
        "\n",
        "- ![Alt Text](https://developer.qualcomm.com/sites/default/files/attachments/qdn-blog-post-opencl-tvm-image01-800.png)\n",
        "\n",
        "- ![Alt Text](https://i.imgur.com/BCg6gCz.png)\n",
        "\n",
        "- ![Alt Text](https://russianblogs.com/images/240/46786653c6e5d7f59d4dee3d26aa6718.png)\n",
        "\n",
        "- ![Alt Text](https://www.researchgate.net/publication/348753417/figure/fig6/AS:984105013497856@1611640327149/An-example-of-blocks-for-scheduling-optimization-with-their-corresponding-scheduling.png)\n",
        "\n",
        "\n",
        "## 6 Step High Level Process\n",
        "\n",
        "\n",
        "1. Model Import: First, the PyTorch model is imported into TVM.\n",
        "\n",
        "2. Optimization: TVM optimizes the computational graph of the model, performing various transformations and applying optimizations like quantization.\n",
        "\n",
        "3. Partitioning: The optimized model is then divided into smaller chunks, or \"shards\". This is typically done in such a way as to minimize dependencies between the shards so that they can be loaded and run independently.\n",
        "\n",
        "4. Code Generation: TVM generates the code for each shard, targeting the specific architecture (iOS or Android).\n",
        "\n",
        "5. Packaging: The shards are bundled into separate binary files, often with metadata to indicate their sequence and dependencies.\n",
        "\n",
        "6. Deployment: These binary shards can then be deployed onto the mobile device. The application logic is responsible for loading the shards as needed and performing the inference.\n",
        "\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "n7HfbfpTOcnd"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip uninstall -y numpy\n",
        "!pip uninstall -y setuptools\n",
        "!pip install setuptools\n",
        "!pip install numpy"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 374
        },
        "id": "z_1q7Qm1l99A",
        "outputId": "3a128d6f-906f-4ae3-f357-270f848c971a"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Found existing installation: numpy 1.23.5\n",
            "Uninstalling numpy-1.23.5:\n",
            "  Successfully uninstalled numpy-1.23.5\n",
            "Found existing installation: setuptools 67.7.2\n",
            "Uninstalling setuptools-67.7.2:\n",
            "  Successfully uninstalled setuptools-67.7.2\n",
            "Collecting setuptools\n",
            "  Using cached setuptools-68.2.2-py3-none-any.whl (807 kB)\n",
            "Installing collected packages: setuptools\n",
            "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
            "ipython 7.34.0 requires jedi>=0.16, which is not installed.\n",
            "numba 0.56.4 requires numpy<1.24,>=1.18, but you have numpy 1.26.0 which is incompatible.\n",
            "tensorflow 2.13.0 requires numpy<=1.24.3,>=1.22, but you have numpy 1.26.0 which is incompatible.\u001b[0m\u001b[31m\n",
            "\u001b[0mSuccessfully installed setuptools-68.2.2\n"
          ]
        },
        {
          "output_type": "display_data",
          "data": {
            "application/vnd.colab-display-data+json": {
              "pip_warning": {
                "packages": [
                  "_distutils_hack",
                  "pkg_resources",
                  "setuptools"
                ]
              }
            }
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (1.26.0)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install apache-tvm\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "FneO_t8YmKn6",
        "outputId": "a6006539-9d30-4072-ccc5-87205e7f453f"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Collecting apache-tvm\n",
            "  Downloading apache_tvm-0.11.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (47.2 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m47.2/47.2 MB\u001b[0m \u001b[31m30.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: attrs in /usr/local/lib/python3.10/dist-packages (from apache-tvm) (23.1.0)\n",
            "Requirement already satisfied: cloudpickle in /usr/local/lib/python3.10/dist-packages (from apache-tvm) (2.2.1)\n",
            "Requirement already satisfied: decorator in /usr/local/lib/python3.10/dist-packages (from apache-tvm) (4.4.2)\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from apache-tvm) (1.23.5)\n",
            "Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from apache-tvm) (5.9.5)\n",
            "Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from apache-tvm) (1.11.2)\n",
            "Collecting synr==0.6.0 (from apache-tvm)\n",
            "  Downloading synr-0.6.0-py3-none-any.whl (18 kB)\n",
            "Requirement already satisfied: tornado in /usr/local/lib/python3.10/dist-packages (from apache-tvm) (6.3.2)\n",
            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from apache-tvm) (4.5.0)\n",
            "Installing collected packages: synr, apache-tvm\n",
            "Successfully installed apache-tvm-0.11.1 synr-0.6.0\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "!python3 -m pip install mlc-ai-nightly -f https://mlc.ai/wheels\n",
        "!python3 -m pip install torch torchvision torchaudio torchsummary --extra-index-url https://download.pytorch.org/whl/cpu"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Yx4V-0s6n-9q",
        "outputId": "40b86e0a-c043-410a-cb2d-ec0e4ac736a3"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Looking in links: https://mlc.ai/wheels\n",
            "Collecting mlc-ai-nightly\n",
            "  Downloading https://github.com/mlc-ai/package/releases/download/v0.9.dev0/mlc_ai_nightly-0.12.dev1576-cp310-cp310-manylinux_2_28_x86_64.whl (86.3 MB)\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m86.3/86.3 MB\u001b[0m \u001b[31m15.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25hRequirement already satisfied: attrs in /usr/local/lib/python3.10/dist-packages (from mlc-ai-nightly) (23.1.0)\n",
            "Requirement already satisfied: cloudpickle in /usr/local/lib/python3.10/dist-packages (from mlc-ai-nightly) (2.2.1)\n",
            "Requirement already satisfied: decorator in /usr/local/lib/python3.10/dist-packages (from mlc-ai-nightly) (4.4.2)\n",
            "Requirement already satisfied: ml-dtypes in /usr/local/lib/python3.10/dist-packages (from mlc-ai-nightly) (0.2.0)\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from mlc-ai-nightly) (1.23.5)\n",
            "Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from mlc-ai-nightly) (5.9.5)\n",
            "Requirement already satisfied: scipy in /usr/local/lib/python3.10/dist-packages (from mlc-ai-nightly) (1.11.2)\n",
            "Requirement already satisfied: tornado in /usr/local/lib/python3.10/dist-packages (from mlc-ai-nightly) (6.3.2)\n",
            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from mlc-ai-nightly) (4.5.0)\n",
            "Installing collected packages: mlc-ai-nightly\n",
            "Successfully installed mlc-ai-nightly-0.12.dev1576\n",
            "Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu\n",
            "Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (2.0.1+cu118)\n",
            "Requirement already satisfied: torchvision in /usr/local/lib/python3.10/dist-packages (0.15.2+cu118)\n",
            "Requirement already satisfied: torchaudio in /usr/local/lib/python3.10/dist-packages (2.0.2+cu118)\n",
            "Requirement already satisfied: torchsummary in /usr/local/lib/python3.10/dist-packages (1.5.1)\n",
            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch) (3.12.2)\n",
            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.10/dist-packages (from torch) (4.5.0)\n",
            "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch) (1.12)\n",
            "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch) (3.1)\n",
            "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch) (3.1.2)\n",
            "Requirement already satisfied: triton==2.0.0 in /usr/local/lib/python3.10/dist-packages (from torch) (2.0.0)\n",
            "Requirement already satisfied: cmake in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch) (3.27.4.1)\n",
            "Requirement already satisfied: lit in /usr/local/lib/python3.10/dist-packages (from triton==2.0.0->torch) (16.0.6)\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.10/dist-packages (from torchvision) (1.23.5)\n",
            "Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from torchvision) (2.31.0)\n",
            "Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in /usr/local/lib/python3.10/dist-packages (from torchvision) (9.4.0)\n",
            "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch) (2.1.3)\n",
            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->torchvision) (3.2.0)\n",
            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->torchvision) (3.4)\n",
            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->torchvision) (2.0.4)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->torchvision) (2023.7.22)\n",
            "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch) (1.3.0)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import IPython\n",
        "import numpy as np\n",
        "import tvm\n",
        "from tvm import relax\n",
        "from tvm.ir.module import IRModule\n",
        "from tvm.script import relax as R\n",
        "from tvm.script import tir as T"
      ],
      "metadata": {
        "id": "IMZaf5hHN0KY"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "import torch\n",
        "import torchvision\n",
        "\n",
        "test_data = torchvision.datasets.FashionMNIST(\n",
        "    root=\"data\",\n",
        "    train=False,\n",
        "    download=True,\n",
        "    transform=torchvision.transforms.ToTensor()\n",
        ")\n",
        "test_loader = torch.utils.data.DataLoader(test_data, batch_size=1, shuffle=True)\n",
        "class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat',\n",
        "               'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle boot']\n",
        "\n",
        "img, label = next(iter(test_loader))\n",
        "img = img.reshape(1, 28, 28).numpy()"
      ],
      "metadata": {
        "id": "ddHa_AgVN05I",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "b113afde-8cb0-44c9-efd7-f37156afdf29"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz\n",
            "Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-images-idx3-ubyte.gz to data/FashionMNIST/raw/train-images-idx3-ubyte.gz\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|██████████| 26421880/26421880 [00:00<00:00, 112959609.42it/s]\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Extracting data/FashionMNIST/raw/train-images-idx3-ubyte.gz to data/FashionMNIST/raw\n",
            "\n",
            "Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz\n",
            "Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/train-labels-idx1-ubyte.gz to data/FashionMNIST/raw/train-labels-idx1-ubyte.gz\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|██████████| 29515/29515 [00:00<00:00, 4669742.84it/s]\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Extracting data/FashionMNIST/raw/train-labels-idx1-ubyte.gz to data/FashionMNIST/raw\n",
            "\n",
            "Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz\n",
            "Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-images-idx3-ubyte.gz to data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|██████████| 4422102/4422102 [00:00<00:00, 64735997.47it/s]\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Extracting data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz to data/FashionMNIST/raw\n",
            "\n",
            "Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz\n",
            "Downloading http://fashion-mnist.s3-website.eu-central-1.amazonaws.com/t10k-labels-idx1-ubyte.gz to data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "100%|██████████| 5148/5148 [00:00<00:00, 5152058.46it/s]\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Extracting data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz to data/FashionMNIST/raw\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import matplotlib.pyplot as plt\n",
        "\n",
        "plt.figure()\n",
        "plt.imshow(img[0])\n",
        "plt.colorbar()\n",
        "plt.grid(False)\n",
        "plt.show()\n",
        "print(\"Class:\", class_names[label[0]])"
      ],
      "metadata": {
        "id": "ujJDpXLiN4ei",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 452
        },
        "outputId": "58cb4a87-39c9-4c4f-867a-342b09e9a183"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/plain": [
              "<Figure size 640x480 with 2 Axes>"
            ],
            "image/png": "iVBORw0KGgoAAAANSUhEUgAAAesAAAGiCAYAAADHpO4FAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAAA18UlEQVR4nO3dfXRU5b33/8/MJJkk5IkQ8gThWUXLkwWJ8alYc4jQm5bqWT9Eb0GWxVubeAtZPVpaIT6dptUjZbWlsmqLtOtXKtpf1VY9uGxq8OctaMXmUHokCkKJQMKTJBDIJJnZ9x+UqSMB5tozYfZm3i/WXovs7O9cV3Z28s117Wv212NZliUAAOBY3kR3AAAAnB3JGgAAhyNZAwDgcCRrAAAcjmQNAIDDkawBAHA4kjUAAA5HsgYAwOFI1gAAOBzJGgAAhyNZAwBg4M0339SsWbNUWloqj8ejF1988ZwxjY2N+uIXvyi/368xY8ZozZo1Rm2SrAEAMNDZ2amJEydq5cqVUR2/c+dOfeUrX9H111+vpqYmLVq0SN/4xjf02muvRd2mh0IeAADY4/F49MILL2j27NlnPOaBBx7QK6+8oq1bt4b33XLLLTpy5IjWr18fVTspsXY03kKhkPbu3avs7Gx5PJ5EdwcAYMiyLB09elSlpaXyevtvArerq0vd3d0xv45lWaflG7/fL7/fH/NrS9LGjRtVWVkZsa+qqkqLFi2K+jUcl6z37t2rsrKyRHcDABCjlpYWDR06tF9eu6urSyOHZ6l1fzDm18rKytKxY8ci9tXV1emhhx6K+bUlqbW1VUVFRRH7ioqK1NHRoRMnTigjI+Ocr+G4ZJ2dnS1JukYzlaLUBPcGAGCqVz16S6+Gf5/3h+7ubrXuD2rn5uHKybY/eu84GtLIyX9XS0uLcnJywvvjNaqOF8cl61NTESlKVYqHZA0ArvOPlVDn41ZmTrY3pmQdfp2cnIhkHU/FxcVqa2uL2NfW1qacnJyoRtVSP64GX7lypUaMGKH09HSVl5fr3Xff7a+mAABJKmiFYt76W0VFhRoaGiL2vf7666qoqIj6NfolWa9bt061tbWqq6vT+++/r4kTJ6qqqkr79+/vj+YAAEkqJCvmzdSxY8fU1NSkpqYmSSffmtXU1KTdu3dLkpYsWaJ58+aFj7/77rv18ccf6/7779e2bdv005/+VM8995wWL14cdZv9kqyXL1+uhQsXasGCBbrsssu0atUqZWZmavXq1acdGwgE1NHREbEBABCNUBz+mXrvvfd0+eWX6/LLL5ck1dbW6vLLL9eyZcskSfv27QsnbkkaOXKkXnnlFb3++uuaOHGinnzySf385z9XVVVV1G3G/Z51d3e3Nm/erCVLloT3eb1eVVZWauPGjacdX19fr4cffjje3QAAoF9MmzZNZ3tESV9PJ5s2bZr+8pe/2G4z7iPrgwcPKhgM9rlMvbW19bTjlyxZovb29vDW0tIS7y4BAC5QQcuKeXODhK8Gj+cbzwEAycXufefPxrtB3EfWBQUF8vl8fS5TLy4ujndzAABc8OKerNPS0jR58uSIZeqhUEgNDQ1Gy9QBADiXkCwFY9jcMrLul2nw2tpazZ8/X1OmTNHUqVO1YsUKdXZ2asGCBf3RHAAgSSXLNHi/JOs5c+bowIEDWrZsmVpbWzVp0iStX7/+tEVnAADg3PptgVlNTY1qamr66+UBAIh5RTerwQEA6Gehf2yxxLtB/xUaBQAAccHIGgDgWqdWdccS7wYkawCAawWtk1ss8W5AsgYAuBb3rAEAgCMwsgYAuFZIHgXliSneDUjWAADXClknt1ji3YBpcAAAHI6RNQDAtYIxToPHEns+kawBAK5Fskby8PrsxYWC8e1HHKWU2KudHrik1DjG19VrHOM91m0cE8pKM46xy7N5m3GM1WP+Ndnh8fuNY6xAoB96Apw/JGsAgGuFLI9CVgyrwWOIPZ9I1gAA10qWaXBWgwMA4HCMrAEArhWUV8EYxp3OXXkTiWQNAHAtK8Z71hb3rAEA6F/cswYAAI7AyBoA4FpBy6ugFcM9a5c8G5xkDQBwrZA8CsUwSRySO7I10+AAADgcI2sAgGslywIzkjUAwLViv2fNNDgAAIgD546svT7JY1ANysEVoGzzmE/PeG1UJAp1dRnH2OW7ZIxxzKeTC4xjMlt7jGMkyb9ll3FM8NBh45iQjb/m7VSb0mXm51uS9v7vKeZBNgYoJcvfNm/mPFbQ8qSY/4q0es2rsMG+kwvMYijkwTQ4AAD9KxTj40ZZDQ4AAOKCkTUAwLWSZYEZyRoA4FoheZPioSgkawCAawUtj4IxVM6KJfZ84p41AAAOx8gaAOBawRhXgweZBgcAoH+FLK9CMSwws/PMg0RgGhwAAIdjZA0AcC2mwQEAcLiQYlvRHYpfV/oV0+AAADicc0fWoaDkcejfEjYKbHjS0oxj7BQssFOUI2XoEOMYSWqdOcw4JmuvecGVnLWbjGPscnI5GFsFLP7yN1ttlfzFPCYw8wrjmE/+vy8Yx6S/lmMcU/CzjcYxEkU53CD2h6I4NM98jnOTNQAA5xD740bdkazd0UsAAJIYI2sAgGtRzxoAAIdLlmlwkjUAwLVif5+1O5K1O3oJAEASY2QNAHCtkOVRKJaHorikRCbJGgDgWqEYp8Hd8j5rd/QSAIAkxsgaAOBasZfIdMeYlWQNAHCtoDwKxvBe6Vhizyd3/EkBAEASc+7I2uMxKpjhSUk1bsIK2izbEDKPs1OEwZeXaxyzc5F5YQSvjfoQkjRi3V7jmN6Pd5k3ZKNwijcz07wd2SyWYafgjGWjMJ/PZx5j8xq387Phf/XPxjHD/pRuHLNtxQTjmJ4XLzWOkaQhdeYxof/6wFZbsIdpcAAAHC6o2KaynVxp77Pc8ScFAABJjJE1AMC1kmUaPO69fOihh+TxeCK2sWPHxrsZAADChTxi2dygX3r5hS98Qfv27Qtvb731Vn80AwBIctY/SmTa3Syb97tXrlypESNGKD09XeXl5Xr33XfPevyKFSt0ySWXKCMjQ2VlZVq8eLG6urqibq9fpsFTUlJUXFwc1bGBQECBz6zA7ejo6I8uAQAQF+vWrVNtba1WrVql8vJyrVixQlVVVWpublZhYeFpx69du1bf/va3tXr1al111VX68MMPdccdd8jj8Wj58uVRtdkvI+uPPvpIpaWlGjVqlG677Tbt3r37jMfW19crNzc3vJWVlfVHlwAAF6BETIMvX75cCxcu1IIFC3TZZZdp1apVyszM1OrVq/s8/u2339bVV1+tW2+9VSNGjND06dM1d+7cc47GPyvuybq8vFxr1qzR+vXr9dRTT2nnzp269tprdfTo0T6PX7Jkidrb28NbS0tLvLsEALhAnaq6FcsmnZzV/ewWOMMzF7q7u7V582ZVVlaG93m9XlVWVmrjxo19xlx11VXavHlzODl//PHHevXVVzVz5syov864T4PPmDEj/P8JEyaovLxcw4cP13PPPac777zztOP9fr/8fn+8uwEAQNQ+P6tbV1enhx566LTjDh48qGAwqKKiooj9RUVF2rZtW5+vfeutt+rgwYO65pprZFmWent7dffdd+s73/lO1P3r97du5eXl6eKLL9b27dv7uykAQJIJxlgi81RsS0uLcnJywvvjOYhsbGzU9773Pf30pz9VeXm5tm/frvvuu0+PPvqoli5dGtVr9HuyPnbsmHbs2KHbb7+9v5sCACSZz05l242XpJycnIhkfSYFBQXy+Xxqa2uL2N/W1nbGhdVLly7V7bffrm984xuSpPHjx6uzs1N33XWXvvvd78rrPfcfG3G/Z/2tb31LGzZs0K5du/T222/r61//unw+n+bOnRvvpgAAOK/S0tI0efJkNTQ0hPeFQiE1NDSooqKiz5jjx4+flpB9/3jWv2VZUbUb95H1J598orlz5+rQoUMaPHiwrrnmGm3atEmDBw+Od1MRrJ7ufn39WO1+6CrjmIwvHjKOGVV95pX3Z9Lb8olxjCT12oqyIcqL+bNCnZ390JEE6z1vZ/y8CRm8z/SUi++OfgXtKSkjhhnHSNIrb//eOGba1tnGMf7pu4xjcFJIXoViGHfaia2trdX8+fM1ZcoUTZ06VStWrFBnZ6cWLFggSZo3b56GDBmi+vp6SdKsWbO0fPlyXX755eFp8KVLl2rWrFnhpH0ucU/Wzz77bLxfEgCAPgUtj4IxTIPbiZ0zZ44OHDigZcuWqbW1VZMmTdL69evDi852794dMZJ+8MEH5fF49OCDD2rPnj0aPHiwZs2apX//93+Puk2eDQ4AgKGamhrV1NT0+bnGxsaIj1NSUlRXV6e6Ohs1V0+9hu1IAAASLF4LzJyOZA0AcC0rxqpblksKeZCsAQCuFZRHQZvFOE7Fu4E7/qQAACCJMbIGALhWyIrtvnPI/J2hCUGyBgC4VijGe9axxJ5P7uglAABJjJE1AMC1QvIoFMMisVhizyeSNQDAtRLxBLNEYBocAACHc+7I2rIkOXOZ3odPX2Ec4z0aMo4Z9tVm4xg7pR68ky6zESVZHvO/SEPp5pdcT06qcYzlO39/LXt7za9T33Hz75SVYv63tafX/LqTpGB6dMUFPsvXY96WnfPgPW5etKf3b+Y/S5I0Y+atxjF77k8zjrGe6Lta09kM/ZO9wi7+//yzrTinSpYFZs5N1gAAnENIMT5u1CX3rN3xJwUAAEmMkTUAwLWsGFeDWy4ZWZOsAQCuRdUtAAAcLlkWmLmjlwAAJDFG1gAA12IaHAAAh0uWx40yDQ4AgMMxsgYAuBbT4AAAOFyyJGumwQEAcDhG1gAA10qWkbVzk7XXJ3kMKv+EgsZNBK//onGMJL30Lz82jvnXjf/LOGb38+ONYyaW7jGO2XvshHGMJFUU7jSO2XMizzjmnV0jjGOys+x9TR1HM41jMjIDxjGlOR3GMQNSzds53mteAUqS9hweaBxz4tMM45gxow4ax+SnHzOOuSTLvFKXJD37oXn1sWCnecxFX2wxjtkzOtc4RpIyC80rfA385UZbbZ0PyZKsmQYHAMDhnDuyBgDgHCzF9l5p82r0iUGyBgC4VrJMg5OsAQCulSzJmnvWAAA4HCNrAIBrJcvImmQNAHCtZEnWTIMDAOBwjKwBAK5lWR5ZMYyOY4k9n0jWAADXop41AABwBEbWAADXSpYFZo5N1t60FHk9qVEfH+oyL+Sx86v2ihy80G5eACQYNJ/EuHKEeaGM7pD5t7Rlb75xjCS1vl9sHNMzqNc4JvWQ+deUtcne9zY9w/z7FPL5jWP2D8wzjknpNH8w4uEJ9h6mGMrrMY7xdJufu7/vN7/2dgQKjWP+3DbWOEaSrDLzgjBXXbzDOGZnh/l5GHLT34xjJKnrf0w1jjl+U7nR8b09XdLvXzJux45kuWfNNDgAAA7n2JE1AADnwjQ4AAAOlyzT4CRrAIBrWTGOrN2SrLlnDQCAwzGyBgC4liXJsvemh3C8G5CsAQCuFZJHHp5gBgAAEo2RNQDAtVgNDgCAw4UsjzxJ8D5rpsEBAHA4RtYAANeyrBhXg7tkObhjk3WoK6CQJ9SvbfzPL///tuJ2nzB/6H7OpgzjmPf/Os44JjDI/MpLsXmx+g+bTx/15JrH+Me2G8e0FGYax0hSygGfcUzWJ+ZfU3e2cYg6Rpt/o3K225s8OzbMvBBK3hcOGcccD0RfrOcUO0VxUkYdM46RpEuLWo1jPjhkXmgkFDL/mvKGlBrHSFL6y+8ax6SUmBXt6Q11G7dhV7Lcs2YaHAAAh3PsyBoAgHNJlpE1yRoA4FqsBj+DN998U7NmzVJpaak8Ho9efPHFiM9blqVly5appKREGRkZqqys1EcffRSv/gIAEHZqgVksmxsYJ+vOzk5NnDhRK1eu7PPzjz/+uH70ox9p1apVeueddzRgwABVVVWpq6sr5s4CAJCMjKfBZ8yYoRkzZvT5OcuytGLFCj344IP62te+Jkn61a9+paKiIr344ou65ZZbTosJBAIKBALhjzs6Oky7BABIUidHx7Hcs45jZ/pRXFeD79y5U62traqsrAzvy83NVXl5uTZu3NhnTH19vXJzc8NbWVlZPLsEALiAnVpgFsvmBnFN1q2tJ9+TWFRUFLG/qKgo/LnPW7Jkidrb28NbS0tLPLsEAIDrJXw1uN/vl9/vT3Q3AAAuZCm2mtQumQWP78i6uPjkU27a2toi9re1tYU/BwBAvDANbsPIkSNVXFyshoaG8L6Ojg698847qqioiGdTAAAkDeNp8GPHjmn79u3hj3fu3Kmmpibl5+dr2LBhWrRokR577DFddNFFGjlypJYuXarS0lLNnj07nv0GACBp5sGNk/V7772n66+/PvxxbW2tJGn+/Plas2aN7r//fnV2duquu+7SkSNHdM0112j9+vVKT0+PX6/7cuUE45C3D35qq6lR2QeNY0bP+dA45i9/HmMck7nXxmSJzfkVy7zmhTL2mS+TCO7JM45JN68PcbKtdPOf3ONF5jFZnxiHqNtGEZSOMfaK4WS0mV8UR7YOMo7xBo1DlGLjS+rOsvcbufn9i4xjcv5u3kHrf5r/TmmbMdw4RpIG/XyvcUzvPrOCJr1Wj3EbtsU6lW0zduXKlXriiSfU2tqqiRMn6sc//rGmTp16xuOPHDmi7373u/rd736nw4cPa/jw4VqxYoVmzpwZVXvGvzmnTZsm6yxvTPN4PHrkkUf0yCOPmL40AABGElEic926daqtrdWqVatUXl6uFStWqKqqSs3NzSosPL3qWnd3t/7lX/5FhYWF+u1vf6shQ4bo73//u/Ly8qJuM+GrwQEAcJPly5dr4cKFWrBggSRp1apVeuWVV7R69Wp9+9vfPu341atX6/Dhw3r77beVmnpy2m/EiBFGbVIiEwDgWvFaDd7R0RGxffbJmp/V3d2tzZs3Rzz8y+v1qrKy8owP//r973+viooKVVdXq6ioSOPGjdP3vvc9BYPR3wciWQMA3MvyxL5JKisri3iaZn19fZ/NHTx4UMFg0OjhXx9//LF++9vfKhgM6tVXX9XSpUv15JNP6rHHHov6y2QaHACQ9FpaWpSTkxP+OJ4P6wqFQiosLNTPfvYz+Xw+TZ48WXv27NETTzyhurq6qF6DZA0AcK14LTDLycmJSNZnUlBQIJ/PZ/Twr5KSEqWmpsrn++dbaC699FK1traqu7tbaWlp52yXaXAAgHtZcdgMpKWlafLkyREP/wqFQmpoaDjjw7+uvvpqbd++XaHQP9/W9+GHH6qkpCSqRC2RrAEAMFJbW6unn35av/zlL/XBBx/onnvuUWdnZ3h1+Lx587RkyZLw8ffcc48OHz6s++67Tx9++KFeeeUVfe9731N1dXXUbTINDgBwrVif720nds6cOTpw4ICWLVum1tZWTZo0SevXrw8vOtu9e7e83n+OhcvKyvTaa69p8eLFmjBhgoYMGaL77rtPDzzwQNRtkqwBAO6WgEeG1tTUqKamps/PNTY2nravoqJCmzZtst0e0+AAADgcI2sAgGslYho8EUjWAAD3ouqWu+yZlmUcc3yPvS9/8OhjxjHF6UeNY+Z++f8Yxzy37XLjmJS/mZ87SUrtMI9JO2L+k+GxUWUplGrvr+WUE+YxQRvPTrDzx3zOx+YxvZn27nR1n/vtpqdJP2T+RXltFGfyBcyvoWC6vfPQm2Ees39m34+pPJtir/lF3lFs7xo3r43mdJ5/bLHEOx/3rAEAcLgLZmQNAEhCTIMDAOBwSZKsmQYHAMDhGFkDANzrM2Uubce7AMkaAOBa8aq65XRMgwMA4HCMrAEA7pUkC8xI1gAA90qSe9ZMgwMA4HCMrAEAruWxTm6xxLsByRoA4F7cs3aX7jzzM+49kGarrcNDM41j/nag2DhmfOE+45hxpeYx/9Uz1DhGkgLn6VaPddj8++TrsneHx87tK0/QRkN22rHxS8Wy+ZvIToGNLvMfC1kp5v2zbPzWsmwOn9JKOo1jfCHzb+6BI+bFdHoH26hwcyHinjUAAHCCC2ZkDQBIQkyDAwDgcEmSrJkGBwDA4RhZAwDcK0lG1iRrAIB7sRocAAA4ASNrAIBr8QQzAACcLknuWTMNDgCAw5GsAQBwOKbBAQCu5VGM96zj1pP+5dhk3fulSVJKetTHp3San/LuXHvf4eFZh41jmj8cYh6TYl4hoiznU+OYzKyAcYwkHTswwDgma7B5YQRPVpdxTNcJe0VarJD5ZJOdqyjUa369WjYKRHh89q5xy0ac12d+vfq85u2kpJgXsPDY/G0eDJpfDx4bv/2DvebtXHXFNvOGJB2wFeVgvHULAAA4gWNH1gAAnFOSrAYnWQMA3CtJkjXT4AAAOBwjawCAa/EEMwAAnI5pcAAA4ASMrAEA7pUkI2uSNQDAtZLlnjXT4AAAOBwjawCAeyXJ40ZJ1gAA9+KedWLt/VKavOnRF2MY9p/HjdvYPi/VOEaSBqcdM47xdJv/9XbwQLZxTMjGhZee2mseJKkrq8c4pvNo9MVZTklJM+9fqo0YSfLbOBcDM08Yxxw5YX4eeoM+45hMf7dxjCR5bdzIszM+CfSa/wrq7jU/D/kDzH8/SNIn75caxwzaYn7uOoeY35E8XJxpHHPSEZtxzsQ9awAA4AiOHVkDAHBOTIMDAOBwMU6DuyVZG0+Dv/nmm5o1a5ZKS0vl8Xj04osvRnz+jjvukMfjidhuvPHGePUXAICkY5ysOzs7NXHiRK1cufKMx9x4443at29fePvNb34TUycBAOiTFYfNBYynwWfMmKEZM2ac9Ri/36/i4uKoXi8QCCgQCIQ/7ujoMO0SACBZJck9635ZDd7Y2KjCwkJdcskluueee3To0KEzHltfX6/c3NzwVlZW1h9dAgDAteKerG+88Ub96le/UkNDg37wgx9ow4YNmjFjhoLBYJ/HL1myRO3t7eGtpaUl3l0CAFygTr3POpbNDeK+GvyWW24J/3/8+PGaMGGCRo8ercbGRt1www2nHe/3++X3++PdDQAALhj9/lCUUaNGqaCgQNu3b+/vpgAAuCD1+/usP/nkEx06dEglJSX93RQAINkkyQIz42R97NixiFHyzp071dTUpPz8fOXn5+vhhx/WzTffrOLiYu3YsUP333+/xowZo6qqqrh2HACAZHk2uHGyfu+993T99deHP66trZUkzZ8/X0899ZS2bNmiX/7ylzpy5IhKS0s1ffp0Pfroo8b3pf2XtsuX2RX18b5HPjZ6fUny3n2ZcYwk5ad0GsdYA/peYBdv7UfNH+5vp2iDJKX5zQt5+DJDxjG9Ngo3eGxWvbNslMs7cGyAcUwwaH4Hys7XFOixN3mWkx4490Gf4zlPv/V8XvNrKDvN/OuRpN4888Iuvh7z6zXFRp2RD5qHmgdJulh7bcU5mksSbiyMf5KnTZsmyzrzmXnttddi6hAAAIjEs8EBAO7FPWsAAJwtWe5ZU88aAACHY2QNAHAvpsEBAHA2psEBAIAjkKwBAO6VoHrWK1eu1IgRI5Senq7y8nK9++67UcU9++yz8ng8mj17tlF7JGsAgHslIFmvW7dOtbW1qqur0/vvv6+JEyeqqqpK+/fvP2vcrl279K1vfUvXXnutcZskawBA0uvo6IjYAoEzP/Vu+fLlWrhwoRYsWKDLLrtMq1atUmZmplavXn3GmGAwqNtuu00PP/ywRo0aZdw/kjUAwLXiVc+6rKxMubm54a2+vr7P9rq7u7V582ZVVlaG93m9XlVWVmrjxo1n7OcjjzyiwsJC3Xnnnba+TlaDAwDcK05v3WppaVFOTk5495nqWRw8eFDBYFBFRUUR+4uKirRt27Y+Y9566y394he/UFNTk+1ukqwBAO4Vp2Sdk5MTkazj5ejRo7r99tv19NNPq6CgwPbrODZZXzdkh9KyUqM+/oOAeXWczAH2KvHsCeQZx/gyzPuXnm5e1aq727zij89n70rPG3DCOKb9eIZxTIa/2zgmPdX8fEtSx4l045gTx80qyklSTrZ5maUsG+ch1Wev2tvxnuh/9k7pDZpfe/4U8+9Trj/6anynpHjsnYchww8Zx+z5svkvZI+Nn/WcPBuluhCzgoIC+Xw+tbW1Rexva2tTcXHxacfv2LFDu3bt0qxZs8L7QqGTleNSUlLU3Nys0aNHn7Nd7lkDAFwrXveso5WWlqbJkyeroaEhvC8UCqmhoUEVFRWnHT927Fj99a9/VVNTU3j76le/quuvv15NTU0qKyuLql3HjqwBADinBDxutLa2VvPnz9eUKVM0depUrVixQp2dnVqwYIEkad68eRoyZIjq6+uVnp6ucePGRcTn5eVJ0mn7z4ZkDQCAgTlz5ujAgQNatmyZWltbNWnSJK1fvz686Gz37t3yeuM7cU2yBgC4VqKeDV5TU6Oampo+P9fY2HjW2DVr1hi3R7IGALhXklTdYoEZAAAOx8gaAOBeSTKyJlkDAFzL848tlng3YBocAACHY2QNAHAvpsEBAHC2RL1163wjWQMA3IuRdWLlpx5TeqpJMQHzAgzFOUeNYyTpg47TH9buFMFeO4U87BW96Ooxv3wy0swLFozMMy+m8N/77X2PQiHz5SZjSvYbx6R4Q8YxHQHzazwYsrcsJSvNvGhIyDI/d14bwxqfjXOXbvMaz0g1v17tSP0kzTimuKzt3Af1wTdwoHFM8NNPbbWF+HFssgYAICouGR3HgmQNAHCtZLlnzVu3AABwOEbWAAD3YoEZAADOxjQ4AABwBEbWAAD3YhocAABnYxocAAA4AiNrAIB7MQ0OAIDDkawBAHC2ZLln7dhkfaQnU/4ek0Ie5g/3z0ix95D+7qB5sYzU1KB5OwHzb49loxBFSop53yQpJz1gHDMm56BxzH9/WmQc09trbznGkPx245iekPn1YC/G/Gvy++x9b1O99uJM2Snkke4z/7kNyfznQpIGpJgXNCkYesQ45tNj+cYxg9I7jWMk6UiBeSEPUcgj4RybrAEAOCemwQEAcDaPZclj2c+4scSeT7x1CwAAh2NkDQBwL6bBAQBwtmRZDc40OAAADsfIGgDgXkyDAwDgbEyDAwAAR2BkDQBwL6bBAQBwtmSZBidZAwDci5F1YrWcGKhUb5pBhHmBiF4bhREke8UHAsdNipKclOLvNY4ZkN1lHJObYR4jSUMGmBe9+LQ7wzim7XCOcczggUeNYyTJ7zM/54Gg+Y/R4c5M4xiv17xYTaqNGLvs/FzYEbLMi3J09Zr//EmS12N+/spyzIteHCrMMo7Z0lZqHCNJw9LMr3EknmOTNQAA0XDLVHYsSNYAAPeyrJNbLPEuwFu3AABwOKNkXV9fryuuuELZ2dkqLCzU7Nmz1dzcHHFMV1eXqqurNWjQIGVlZenmm29WW1tbXDsNAID0z9XgsWxuYJSsN2zYoOrqam3atEmvv/66enp6NH36dHV2doaPWbx4sf7whz/o+eef14YNG7R3717ddNNNce84AADh1eCxbC5gdM96/fr1ER+vWbNGhYWF2rx5s6677jq1t7frF7/4hdauXasvf/nLkqRnnnlGl156qTZt2qQrr7zytNcMBAIKBALhjzs6Oux8HQAAXLBiumfd3n7yrTv5+fmSpM2bN6unp0eVlZXhY8aOHathw4Zp48aNfb5GfX29cnNzw1tZWVksXQIAJBFPKPbNDWwn61AopEWLFunqq6/WuHHjJEmtra1KS0tTXl5exLFFRUVqbW3t83WWLFmi9vb28NbS0mK3SwCAZMM0+NlVV1dr69ateuutt2LqgN/vl9/vj+k1AAC4kNkaWdfU1Ojll1/WG2+8oaFDh4b3FxcXq7u7W0eOHIk4vq2tTcXFxTF1FACAz2M1eB8sy1JNTY1eeOEF/elPf9LIkSMjPj958mSlpqaqoaEhvK+5uVm7d+9WRUVFfHoMAMAppx6KEsvmAkbT4NXV1Vq7dq1eeuklZWdnh+9D5+bmKiMjQ7m5ubrzzjtVW1ur/Px85eTk6N5771VFRUWfK8EBAIgFVbf68NRTT0mSpk2bFrH/mWee0R133CFJ+uEPfyiv16ubb75ZgUBAVVVV+ulPf2rcsf/aMlLejPSoj7/IRiGPjkD0r/9Zdr63Hq95lNdGTKovaBxTkNF57oP64PeaFwR49+/DjWMGZAbOfdDnZKb2GMdIUk/IZxxz5IT5dWSniMzwXPPCKXaLa9jp3wkbxTLs9C8n1bzwTNuJbOMYSSrKMC8I09Fjfj1kDDC/xrPSzWMkSRYPrnQjo2RtRTFdkJ6erpUrV2rlypW2OwUAQFQokQkAgLMlyzQ48yEAADgcI2sAgHslSYlMkjUAwLWYBgcAAI7AyBoA4F6sBgcAwNmYBgcAAI7AyBoA4F4h6+QWS7wLkKwBAO7FPWsAAJzNoxjvWcetJ/2Le9YAADicY0fWg9/zyJcW/d88gZlXGLfx6QbzKkGS9D/+n7eNY5pShxrHdPakGcfY+QPT6wnZiJL++9Mi45iQjWpOFw06YBxzvNf83ElS0Eb/CjKPG8f02qh8dKzbbxyTlWazMpMNIct8jJLqNa8S5/eZV3uzW33MznWU7jOv+JY34IRxjJ1rVZI8n3bYinMsnmAGAICz8dYtAADQp5UrV2rEiBFKT09XeXm53n333TMe+/TTT+vaa6/VwIEDNXDgQFVWVp71+L6QrAEA7mXFYTO0bt061dbWqq6uTu+//74mTpyoqqoq7d+/v8/jGxsbNXfuXL3xxhvauHGjysrKNH36dO3ZsyfqNknWAADX8lhWzJskdXR0RGyBwJnXeyxfvlwLFy7UggULdNlll2nVqlXKzMzU6tWr+zz+17/+tb75zW9q0qRJGjt2rH7+858rFAqpoaEh6q+TZA0ASHplZWXKzc0Nb/X19X0e193drc2bN6uysjK8z+v1qrKyUhs3boyqrePHj6unp0f5+flR948FZgAA9wr9Y4slXlJLS4tycnLCu/3+vt99cfDgQQWDQRUVRb4bpqioSNu2bYuqyQceeEClpaURCf9cSNYAANf67FS23XhJysnJiUjW/eX73/++nn32WTU2Nio9PT3qOJI1AABRKigokM/nU1tbW8T+trY2FRcXnzX2P/7jP/T9739ff/zjHzVhwgSjdrlnDQBwr/O8GjwtLU2TJ0+OWBx2arFYRUXFGeMef/xxPfroo1q/fr2mTJli1qgYWQMA3CwBTzCrra3V/PnzNWXKFE2dOlUrVqxQZ2enFixYIEmaN2+ehgwZEl6k9oMf/EDLli3T2rVrNWLECLW2tkqSsrKylJWVFVWbJGsAgGsl4glmc+bM0YEDB7Rs2TK1trZq0qRJWr9+fXjR2e7du+X1/nPi+qmnnlJ3d7f+9V//NeJ16urq9NBDD0XVJskaAABDNTU1qqmp6fNzjY2NER/v2rUr5vYcm6xz1v1ZKZ7oC220LrqqH3sT6W/tJcYxdopytJ+IfqXgKbkZXcYxdosc2Im7Y3x070P8rIKUo8YxOwODjWMk6WAguimpz/r46CDjmNw08+9TV9D8x7XXZrGHFK/5e2HsFNiw086hwADjGDtFRiR75zxoo/jHoAzzYjBb/2u4cYwk5e7bbivOsSjkAQCAs3lCJ7dY4t2A1eAAADgcI2sAgHsxDQ4AgMPZrJwVEe8CTIMDAOBwjKwBAK4Vr2eDOx3JGgDgXklyz5ppcAAAHI6RNQDAvSzFVs/aHQNrkjUAwL24Zw0AgNNZivGeddx60q+4Zw0AgMNdMCPrnL8HjWN897TZauvgcfNCAhcP3G8ckz/Y/OH+Wb6AcUymt9s4RpK2ekuNY/5z72XGMUc6M4xjTnT6jWMkaVjxYeOY0gHtxjGfHMszjvGnmBfKsFvAoifkM46xU9glxcbNxjSv+XlI85vHSFK6jeIkds75uOy9xjF/zRxqHGObx/Rr8py/EWuSrAa/YJI1ACAJhSTZ+5v0n/EuwDQ4AAAOx8gaAOBarAYHAMDpkuSeNdPgAAA4HCNrAIB7JcnImmQNAHCvJEnWTIMDAOBwjKwBAO6VJO+zJlkDAFyLt24BAOB03LMGAABOcMGMrNtHmRceuDjjmK22/rrXvICFf5B5oZHOXvNiFEd70o1j9hzPNY6RpJ0HBhnHdHemGcekZpoXGrls2D7jGEm6NKfVOMZvo7BEfpp5kRavx/zm2v5AtnGMJLUdN487eMy8wM2JrlTjmGCv+c+6XVa7+fWadsh8DLTv7THGMRev/7NxjG2mo8/zOVoNWZKNIjIR8S5wwSRrAEASYhocAAA4ASNrAICLxTiyPm+Ft2NjNLKur6/XFVdcoezsbBUWFmr27Nlqbm6OOGbatGnyeDwR29133x3XTgMAIOmf0+CxbC5glKw3bNig6upqbdq0Sa+//rp6eno0ffp0dXZ2Rhy3cOFC7du3L7w9/vjjce00AADJxGgafP369REfr1mzRoWFhdq8ebOuu+668P7MzEwVFxdH9ZqBQECBQCD8cUdHh0mXAADJLGQppqlsl6wGj2mBWXt7uyQpPz8/Yv+vf/1rFRQUaNy4cVqyZImOHz/z21Tq6+uVm5sb3srKymLpEgAgmVih2DcXsL3ALBQKadGiRbr66qs1bty48P5bb71Vw4cPV2lpqbZs2aIHHnhAzc3N+t3vftfn6yxZskS1tbXhjzs6OkjYAAB8hu1kXV1dra1bt+qtt96K2H/XXXeF/z9+/HiVlJTohhtu0I4dOzR69OjTXsfv98vvN3/4BwAAvM/6LGpqavTyyy/rjTfe0NChQ896bHl5uSRp+/btdpoCAODMQlbsmwsYjawty9K9996rF154QY2NjRo5cuQ5Y5qamiRJJSUltjoIAMAZJcnI2ihZV1dXa+3atXrppZeUnZ2t1taTz1HOzc1VRkaGduzYobVr12rmzJkaNGiQtmzZosWLF+u6667ThAkT+uULAADgQmeUrJ966ilJJx988lnPPPOM7rjjDqWlpemPf/yjVqxYoc7OTpWVlenmm2/Wgw8+GLcOAwAQZinGkXXcetKvjKfBz6asrEwbNmyIqUN2lTz5tnHM8f+30FZbo/PMKybtCZq3ZWVlGMecKDWvfNQ10F4Vo9Qy8yUPoTzzn4xguvk6yA93jjCOkaSelwabB23aYqOlwLkPiYvDtqLSbMRF92QFIM6SZBqcQh4AADgchTwAAO4VCkmK4cEmoQv8oSgAACQc0+AAAMAJGFkDANwrSUbWJGsAgHtRdQsAADgBI2sAgGtZVkhWDGUuY4k9n0jWAAD3smIsxsE9awAA+pkV4z1rlyRr7lkDAOBwjKwBAO4VCkmeGO47c8/a+YJt++0F2o07D/xNNmJstpVrMw4A4oZpcAAA4ARJPbIGALibFQrJimEanLduAQDQ35gGBwAATsDIGgDgXiFL8lz4I2uSNQDAvSxLUixv3XJHsmYaHAAAh2NkDQBwLStkyYphGtxyyciaZA0AcC8rpNimwd3x1i2mwQEArmWFrJg3O1auXKkRI0YoPT1d5eXlevfdd896/PPPP6+xY8cqPT1d48eP16uvvmrUHskaAAAD69atU21trerq6vT+++9r4sSJqqqq0v79fT+K+u2339bcuXN155136i9/+Ytmz56t2bNna+vWrVG36bEcNmHf3t6uvLw8XaOZSlFqorsDADDUqx69pVd15MgR5eb2TxWBjo4O5ebmxpwrTvW1paVFOTk54f1+v19+f9+VE8rLy3XFFVfoJz/5iSQpFAqprKxM9957r7797W+fdvycOXPU2dmpl19+Obzvyiuv1KRJk7Rq1aroOmo5TEtLy6nH0bCxsbGxuXhraWnpt1xx4sQJq7i4OC79zMrKOm1fXV1dn+0GAgHL5/NZL7zwQsT+efPmWV/96lf7jCkrK7N++MMfRuxbtmyZNWHChKi/XsctMCstLVVLS4uys7Pl8XgiPtfR0aGysrLT/gJKNpyHkzgPJ3EeTuI8nOSE82BZlo4eParS0tJ+ayM9PV07d+5Ud3d3zK9lWdZp+eZMo+qDBw8qGAyqqKgoYn9RUZG2bdvWZ0xra2ufx7e2tkbdR8cla6/Xq6FDh571mJycnKT+YTyF83AS5+EkzsNJnIeTEn0e+mv6+7PS09OVnp7e7+04AQvMAACIUkFBgXw+n9ra2iL2t7W1qbi4uM+Y4uJio+P7QrIGACBKaWlpmjx5shoaGsL7QqGQGhoaVFFR0WdMRUVFxPGS9Prrr5/x+L44bhr8bPx+v+rq6s54LyFZcB5O4jycxHk4ifNwEueh/9XW1mr+/PmaMmWKpk6dqhUrVqizs1MLFiyQJM2bN09DhgxRfX29JOm+++7Tl770JT355JP6yle+omeffVbvvfeefvazn0XdpuPeugUAgNP95Cc/0RNPPKHW1lZNmjRJP/rRj1ReXi5JmjZtmkaMGKE1a9aEj3/++ef14IMPateuXbrooov0+OOPa+bMmVG3R7IGAMDhuGcNAIDDkawBAHA4kjUAAA5HsgYAwOFck6xNy5FdiB566CF5PJ6IbezYsYnuVr978803NWvWLJWWlsrj8ejFF1+M+LxlWVq2bJlKSkqUkZGhyspKffTRR4npbD8613m44447Trs+brzxxsR0tp/U19friiuuUHZ2tgoLCzV79mw1NzdHHNPV1aXq6moNGjRIWVlZuvnmm097IIXbRXMepk2bdtr1cPfddyeox4iVK5K1aTmyC9kXvvAF7du3L7y99dZbie5Sv+vs7NTEiRO1cuXKPj//+OOP60c/+pFWrVqld955RwMGDFBVVZW6urrOc0/717nOgyTdeOONEdfHb37zm/PYw/63YcMGVVdXa9OmTXr99dfV09Oj6dOnq7OzM3zM4sWL9Yc//EHPP/+8NmzYoL179+qmm25KYK/jL5rzIEkLFy6MuB4ef/zxBPUYMYu65EcCTZ061aqurg5/HAwGrdLSUqu+vj6BvTr/6urqrIkTJya6GwklKaLaTSgUsoqLi60nnngivO/IkSOW3++3fvOb3ySgh+fH58+DZVnW/Pnzra997WsJ6U+i7N+/35JkbdiwwbKsk9/71NRU6/nnnw8f88EHH1iSrI0bNyaqm/3u8+fBsizrS1/6knXfffclrlOIK8ePrLu7u7V582ZVVlaG93m9XlVWVmrjxo0J7FlifPTRRyotLdWoUaN02223affu3YnuUkLt3LlTra2tEddHbm6uysvLk/L6aGxsVGFhoS655BLdc889OnToUKK71K/a29slSfn5+ZKkzZs3q6enJ+J6GDt2rIYNG3ZBXw+fPw+n/PrXv1ZBQYHGjRunJUuW6Pjx44noHuLA8Y8btVOO7EJVXl6uNWvW6JJLLtG+ffv08MMP69prr9XWrVuVnZ2d6O4lxKkSc7GWn7sQ3Hjjjbrppps0cuRI7dixQ9/5znc0Y8YMbdy4UT6fL9Hdi7tQKKRFixbp6quv1rhx4ySdvB7S0tKUl5cXceyFfD30dR4k6dZbb9Xw4cNVWlqqLVu26IEHHlBzc7N+97vfJbC3sMvxyRr/NGPGjPD/J0yYoPLycg0fPlzPPfec7rzzzgT2DE5wyy23hP8/fvx4TZgwQaNHj1ZjY6NuuOGGBPasf1RXV2vr1q1JsW7jbM50Hu66667w/8ePH6+SkhLdcMMN2rFjh0aPHn2+u4kYOX4a3E45smSRl5eniy++WNu3b090VxLm1DXA9XG6UaNGqaCg4IK8PmpqavTyyy/rjTfe0NChQ8P7i4uL1d3drSNHjkQcf6FeD2c6D3059dzqC/F6SAaOT9Z2ypEli2PHjmnHjh0qKSlJdFcSZuTIkSouLo64Pjo6OvTOO+8k/fXxySef6NChQxfU9WFZlmpqavTCCy/oT3/6k0aOHBnx+cmTJys1NTXiemhubtbu3bsvqOvhXOehL01NTZJ0QV0PycQV0+DnKkeWLL71rW9p1qxZGj58uPbu3au6ujr5fD7NnTs30V3rV8eOHYsYDezcuVNNTU3Kz8/XsGHDtGjRIj322GO66KKLNHLkSC1dulSlpaWaPXt24jrdD852HvLz8/Xwww/r5ptvVnFxsXbs2KH7779fY8aMUVVVVQJ7HV/V1dVau3atXnrpJWVnZ4fvQ+fm5iojI0O5ubm68847VVtbq/z8fOXk5Ojee+9VRUWFrrzyygT3Pn7OdR527NihtWvXaubMmRo0aJC2bNmixYsX67rrrtOECRMS3HvYkujl6NH68Y9/bA0bNsxKS0uzpk6dam3atCnRXTrv5syZY5WUlFhpaWnWkCFDrDlz5ljbt29PdLf63RtvvGFJOm2bP3++ZVkn3761dOlSq6ioyPL7/dYNN9xgNTc3J7bT/eBs5+H48ePW9OnTrcGDB1upqanW8OHDrYULF1qtra2J7nZc9fX1S7KeeeaZ8DEnTpywvvnNb1oDBw60MjMzra9//evWvn37EtfpfnCu87B7927ruuuus/Lz8y2/32+NGTPG+rd/+zervb09sR2HbZTIBADA4Rx/zxoAgGRHsgYAwOFI1gAAOBzJGgAAhyNZAwDgcCRrAAAcjmQNAIDDkawBAHA4kjUAAA5HsgYAwOFI1gAAONz/BU9HCJYd1ZV8AAAAAElFTkSuQmCC\n"
          },
          "metadata": {}
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Class: Bag\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Hide outputs\n",
        "!wget https://github.com/mlc-ai/web-data/raw/main/models/fasionmnist_mlp_params.pkl"
      ],
      "metadata": {
        "id": "jMcubOsiN6vv",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "d81155c6-20ab-4eb7-a927-3d0692ed0662"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "--2023-09-18 18:54:02--  https://github.com/mlc-ai/web-data/raw/main/models/fasionmnist_mlp_params.pkl\n",
            "Resolving github.com (github.com)... 140.82.121.3\n",
            "Connecting to github.com (github.com)|140.82.121.3|:443... connected.\n",
            "HTTP request sent, awaiting response... 302 Found\n",
            "Location: https://raw.githubusercontent.com/mlc-ai/web-data/main/models/fasionmnist_mlp_params.pkl [following]\n",
            "--2023-09-18 18:54:02--  https://raw.githubusercontent.com/mlc-ai/web-data/main/models/fasionmnist_mlp_params.pkl\n",
            "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...\n",
            "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 407396 (398K) [application/octet-stream]\n",
            "Saving to: ‘fasionmnist_mlp_params.pkl’\n",
            "\n",
            "fasionmnist_mlp_par 100%[===================>] 397.85K  --.-KB/s    in 0.01s   \n",
            "\n",
            "2023-09-18 18:54:02 (30.5 MB/s) - ‘fasionmnist_mlp_params.pkl’ saved [407396/407396]\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "def numpy_mlp(data, w0, b0, w1, b1):\n",
        "    lv0 = data @ w0.T + b0\n",
        "    lv1 = np.maximum(lv0, 0)\n",
        "    lv2 = lv1 @ w1.T + b1\n",
        "    return lv2"
      ],
      "metadata": {
        "id": "YV23AU8LN9Fe"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "import pickle as pkl\n",
        "\n",
        "mlp_params = pkl.load(open(\"fasionmnist_mlp_params.pkl\", \"rb\"))\n",
        "res = numpy_mlp(img.reshape(1, 784),\n",
        "                mlp_params[\"w0\"],\n",
        "                mlp_params[\"b0\"],\n",
        "                mlp_params[\"w1\"],\n",
        "                mlp_params[\"b1\"])\n",
        "print(res)\n",
        "pred_kind = res.argmax(axis=1)\n",
        "print(pred_kind)\n",
        "print(\"NumPy-MLP Prediction:\", class_names[pred_kind[0]])"
      ],
      "metadata": {
        "id": "0GHkLb7KN_cX",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "4b174f5b-0fac-460d-d5a7-c6914c0cdda9"
      },
      "execution_count": null,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "[[-36.13299   -47.72502    -9.275949  -18.921837   -7.161888   -9.535042\n",
            "   -9.7998495 -30.950085   29.353281  -22.556564 ]]\n",
            "[8]\n",
            "NumPy-MLP Prediction: Bag\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "#### Optimization\n",
        "1. Graph-Level Optimization: TVM starts with optimizations at the computation graph level. It takes the imported PyTorch model and tries to simplify its computational graph. This involves fusing nodes, eliminating redundancies, and reordering operations.\n",
        "\n",
        "2. Layout Transformation: The layout of the tensors (NCHW, NHWC, etc.) can be changed to best suit the hardware. Different layouts can have a significant impact on performance due to the access patterns of the memory.\n",
        "\n",
        "3. Kernel Matching: TVM uses a library of optimized kernel implementations for common operations. It attempts to map the operations in the computational graph to these optimized kernels.\n",
        "\n",
        "4. Quantization: If applicable, this is the stage where model weights and activations may be quantized to lower precision formats like INT8 to reduce model size and increase inference speed.\n",
        "\n",
        "\n",
        "\n",
        "- ![Alt Text](https://www.researchgate.net/publication/348753417/figure/fig6/AS:984105013497856@1611640327149/An-example-of-blocks-for-scheduling-optimization-with-their-corresponding-scheduling.png)\n",
        "\n",
        "- ![Alt Text](https://www.allaboutcircuits.com/uploads/articles/qc-tech_quantization_gif-2_final.jpg)\n",
        "\n",
        "- ![Alt Text](https://www.allaboutcircuits.com/uploads/articles/neuron_connections_quantization.jpg)\n",
        "\n",
        "- ![Alt Text](https://tvm.apache.org/images/intro-auto-scheduler/search_overview.png)\n",
        "\n",
        "- ![Alt Text](https://www.nvidia.com/content/dam/en-zz/Solutions/glossary/data-science/xgboost/img-3.png)\n",
        "\n",
        "#### Partitioning\n",
        "1. Dependency Analysis: TVM analyzes the computation graph to identify dependencies between nodes. The goal is to break the graph into independent subgraphs that can be executed in parallel or loaded separately.\n",
        "\n",
        "2. Shard Formation: Once dependencies are mapped, the graph is divided into smaller segments, each forming a \"shard\". These shards are designed such that each can be loaded and executed independently of the others as much as possible.\n",
        "\n",
        "3. Metadata Creation: Metadata is generated for each shard, specifying its dependencies, the sequence in which it should be executed, and other information that will be necessary for loading and running the shard.\n",
        "\n",
        "4. Shard Optimization: After partitioning, each shard is subject to another round of optimization. This is because breaking the graph can sometimes introduce inefficiencies, like redundant operations that were previously fused.\n",
        "\n",
        "- ![Alt Text](https://media.springernature.com/m685/springer-static/image/art%3A10.1038%2Fs43588-021-00119-7/MediaObjects/43588_2021_119_Fig1_HTML.png)\n",
        "\n",
        "\n",
        "#### Code Generation\n",
        "\n",
        "1. Backend Selection: TVM allows you to specify the target backend, such as Vulkan for Android or Metal for iOS. Depending on this choice, different code generation pathways may be used.\n",
        "\n",
        "2. Lowering: The high-level operations in each shard are transformed into low-level, hardware-specific instructions. This is often referred to as \"lowering\" the computation graph.\n",
        "\n",
        "3. Compilation: Once the operations have been lowered, they are compiled into machine code optimized for the target architecture. This code is what will actually be run on the mobile device.\n",
        "\n",
        "4. Packaging: Finally, the compiled machine code for each shard is packaged into a binary file format (.so for Android, .dylib for iOS, etc.). These binary files are now ready for deployment."
      ],
      "metadata": {
        "id": "hYbfG8J6Wf87"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from tvm.script import ir as I\n",
        "from tvm.script import tir as T\n",
        "from tvm.script import relax as R\n",
        "#seperate computation and allocation\n",
        "\n",
        "@I.ir_module\n",
        "class Module:\n",
        "    @T.prim_func\n",
        "    def linear0(x: T.handle, w: T.handle, b: T.handle, z: T.handle):\n",
        "        m = T.int64()\n",
        "        X = T.match_buffer(x, (1, m))\n",
        "        n = T.int64()\n",
        "        W = T.match_buffer(w, (n, m))\n",
        "        B = T.match_buffer(b, (n,))\n",
        "        Z = T.match_buffer(z, (1, n))\n",
        "        # with T.block(\"root\"):\n",
        "        Y = T.alloc_buffer((1, n))\n",
        "        for i, j, k in T.grid(1, n, m):\n",
        "            with T.block(\"Y\"):\n",
        "                vi, vj, vk = T.axis.remap(\"SSR\", [i, j, k])\n",
        "                T.reads(X[vi, vk], W[vj, vk])\n",
        "                T.writes(Y[vi, vj])\n",
        "                with T.init():\n",
        "                    Y[vi, vj] = T.float32(0)\n",
        "                Y[vi, vj] = Y[vi, vj] + X[vi, vk] * W[vj, vk]\n",
        "        for i, j in T.grid(1, n):\n",
        "            with T.block(\"Z\"):\n",
        "                vi, vj = T.axis.remap(\"SS\", [i, j])\n",
        "                T.reads(Y[vi, vj], B[vj])\n",
        "                T.writes(Z[vi, vj])\n",
        "                Z[vi, vj] = Y[vi, vj] + B[vj]\n",
        "\n",
        "    @T.prim_func\n",
        "    def relu0(x: T.handle, y: T.handle):\n",
        "        n = T.int64()\n",
        "        X = T.match_buffer(x, (1, n))\n",
        "        Y = T.match_buffer(y, (1, n))\n",
        "        # with T.block(\"root\"):\n",
        "        for i, j in T.grid(1, n):\n",
        "            with T.block(\"Y\"):\n",
        "                vi, vj = T.axis.remap(\"SS\", [i, j])\n",
        "                T.reads(X[vi, vj])\n",
        "                T.writes(Y[vi, vj])\n",
        "                Y[vi, vj] = T.max(X[vi, vj], T.float32(0))\n",
        "\n",
        "    @R.function\n",
        "    def main(x: R.Tensor((1, \"m\"), dtype=\"float32\"), w0: R.Tensor((\"n\", \"m\"), dtype=\"float32\"), b0: R.Tensor((\"n\",), dtype=\"float32\"), w1: R.Tensor((\"k\", \"n\"), dtype=\"float32\"), b1: R.Tensor((\"k\",), dtype=\"float32\")) -> R.Tensor((1, \"k\"), dtype=\"float32\"):\n",
        "        k = T.int64()\n",
        "        m = T.int64()\n",
        "        n = T.int64()\n",
        "        with R.dataflow():\n",
        "            lv0 = R.call_dps_packed(\"linear0\", (x, w0, b0), out_sinfo=R.Tensor((1, n), dtype=\"float32\"))\n",
        "            lv1 = R.call_dps_packed(\"relu0\", (lv0,), out_sinfo=R.Tensor((1, n), dtype=\"float32\"))\n",
        "            out = R.call_dps_packed(\"linear0\", (lv1, w1, b1), out_sinfo=R.Tensor((1, k), dtype=\"float32\"))\n",
        "            R.output(out)\n",
        "        return out"
      ],
      "metadata": {
        "id": "PTykold3OB0c"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "ex = relax.build(MyModule, target=\"llvm\")\n",
        "type(ex)"
      ],
      "metadata": {
        "id": "eEogZGA3ODyH"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "vm = relax.VirtualMachine(ex, tvm.cpu())"
      ],
      "metadata": {
        "id": "B8RDSdUfOGrz"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "data_nd = tvm.nd.array(img.reshape(1, 784))\n",
        "nd_params = {k: tvm.nd.array(v) for k, v in mlp_params.items()}"
      ],
      "metadata": {
        "id": "fEU2MCu9OIdW"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "nd_res = vm[\"main\"](data_nd,\n",
        "                    nd_params[\"w0\"],\n",
        "                    nd_params[\"b0\"],\n",
        "                    nd_params[\"w1\"],\n",
        "                    nd_params[\"b1\"])\n",
        "print(nd_res)"
      ],
      "metadata": {
        "id": "czq7d0cJOJ1W"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "pred_kind = np.argmax(nd_res.numpy(), axis=1)\n",
        "print(\"MyModule Prediction:\", class_names[pred_kind[0]])"
      ],
      "metadata": {
        "id": "fyEyr11yOLeC"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 6 - Compile a Large Language Model from HuggingFace to iOS & Android with TVM\n",
        "- MLC LLM https://mlc.ai/mlc-llm/docs/install/tvm.html"
      ],
      "metadata": {
        "id": "OBFI-dOI4TMH"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Step 7 - Future Directions"
      ],
      "metadata": {
        "id": "MJHNTRsL4YtJ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "\n",
        "\n",
        "## Dr Dignity Next Steps\n",
        "\n",
        "1. Port RWKV 3B to an Android APK\n",
        "2. Fine-tune Marx_3B on medical data to pass the US Medical Licensing Exam (i'm calling it MedMarx_3b)\n",
        "3. Port MedMarx_3b to an Android APK\n",
        "4. Fine-Tune MedMarx_3B on code dataset (CodeMedMarx_3B)\n",
        "5. Port CodeMedMarx_3B to an Android APK as a Code Interpreter\n",
        "6. Use the Code Interpreter as a retrieval system on a local medical datastore to cite sources\n",
        "7. Fine-Tune CodeMedMarx_3B on biomedical image dataset for multimodal QA\n",
        "8. Continue Fine-tuning on all modalities of biomedical data\n",
        "9. Port MultiModalCodeMedMarx_3B to an Android APK\n",
        "10. Build Local Video Avatar Interface for MultiModalCodeMedMarx_3B\n",
        "\n",
        "The goal is to continually improve Dr Dignity's response accuracy until it replaces the need for human Doctors.\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "### Summary:\n",
        "\n",
        "Option 1: Learn Vulkan for Android and Learn Metal for iOS\n",
        "- Implement Android GPU Inference code in Vulkan\n",
        "- Implement iOS GPU Inference code in Metal\n",
        "\n",
        "(Better) Option 2: Learn Tensor Virtual Machine for iOS and Android\n",
        "- Implement Cross Platform GPU inference code in Relay\n",
        "- Optimize Relay code for Android or iOS\n",
        "- Compile Relay code to Android or iOS\n",
        "\n",
        "\n",
        "\n"
      ],
      "metadata": {
        "id": "2GbMBszSxex9"
      }
    }
  ]
}